Intro to AI alternate Assignment
Summary:
This is an alternative to being in class on 4/19 when I will be off at
a conference. It is intended to firm up your understanding of a small
aspect of computational linguistics while giving you a little hands on
practice. The entire exercise should not take more than the standard 2
hours and 45 minutes of class.
You will turn in a printout of the text file that you generate while
doing this assignment. You will need to turn it in at the beginning of
the next class.
When we left off last time we were discussing n-gram models of part of
speech tagging. We'll look at those a little more in this lesson with
some hands on practice.
We will use the Natural Language Toolkit
available for download. Go to the instillation link and download and
install for your system. You need not install wordnet if you are given
the opportunitiy to do so. You should definately install the corpera
though.
Once you have everything installed, run python as usual and inport the
part of speech tagging stuff along with some other useful modules.
from nltk_lite.tag import *
from nltk_lite.corpora import brown, extract
from nltk_lite.probability import FreqDist
from itertools import islice
from nltk_lite import tag
from nltk_lite.probability import ConditionalFreqDist
This might take a second or two. (Or you can put that at the top of a file and try out the exercizes below in the file)
Next read through the NLTK documentation chapter on part of speech tagging.
We'll go through some of the exercises there.
Lets find out what the number 21-30 most common verbs are in one of the
corpera using a slight variation on the code included in the demo. Try
this out:
def getSecondFreqVerbs():
fd = FreqDist()
for sent in brown.tagged():
for word, tag in sent:
if tag[:2] == 'vb':
fd.inc(word+"/"+tag)
print fd.sorted_samples()[20:30]
Run the function and keep track of the output (put it in your text file and annotate it so I know what this is.)
Now you've found some of the common verbs in that corpus.
Next Follow the directions for section 4.4 except rather than try the
100 most common words, try using the 200 most common words. Show output
for your run in your text file and note how much more effiecient it is
to use the 200 most common words. What return do you get for doubling
your word coverage? How much better is it?
Finally lets take a look at the difference between bigram and unigram
tagging. Follow the exercizes in section 4.6, but when doing the
exercizes in 4.6.3 (which is the one I want you to show the output from
in your file that you print and turn in) train your taggers on section
k (general fiction) and then test them on section 'l' mystery fiction.
How did this change the accuracy? Now retest the taggers on section 'h'
(government press releases). How did your part of speech tagger fare
here when it was tested on a different genre than it was trained on?
Comment on this in your text file.
And that's it. Hopefully this hands on work has helped you understand a
bit more about the everyday tools used in some natuaral language
processing work. Print your file with the output from the three
relevent sections below and bring it to class next time.