Intro to AI alternate Assignment

This is an alternative to being in class on 4/19 when I will be off at a conference. It is intended to firm up your understanding of a small aspect of computational linguistics while giving you a little hands on practice. The entire exercise should not take more than the standard 2 hours and 45 minutes of class.

You will turn in a printout of the text file that you generate while doing this assignment. You will need to turn it in at the beginning of the next class.

When we left off last time we were discussing n-gram models of part of speech tagging. We'll look at those a little more in this lesson with some hands on practice.

We will use the Natural Language Toolkit available for download. Go to the instillation link and download and install for your system. You need not install wordnet if you are given the opportunitiy to do so. You should definately install the corpera though.

Once you have everything installed, run python as usual and inport the part of speech tagging stuff along with some other useful modules.

from nltk_lite.tag import *
from nltk_lite.corpora import brown, extract
from nltk_lite.probability import FreqDist

itertools import islice
nltk_lite import tag
from nltk_lite.probability import ConditionalFreqDist

This might take a second or two. (Or you can put that at the top of a file and try out the exercizes below in the file)

Next read through the NLTK documentation chapter on part of speech tagging.

We'll go through some of the exercises there.

Lets find out what the number 21-30 most common verbs are in one of the corpera using a slight variation on the code included in the demo. Try this out:

def getSecondFreqVerbs():
    fd = FreqDist()
    for sent in brown.tagged():
        for word, tag in sent:
            if tag[:2] == 'vb':
    print fd.sorted_samples()[20:30]

Run the function and keep track of the output (put it in your text file and annotate it so I know what this is.)
Now you've found some of the common verbs in that corpus.

Next Follow the directions for section 4.4 except rather than try the 100 most common words, try using the 200 most common words. Show output for your run in your text file and note how much more effiecient it is to use the 200 most common words. What return do you get for doubling your word coverage? How much better is it?

Finally lets take a look at the difference between bigram and unigram tagging. Follow the exercizes in section 4.6, but when doing the exercizes in 4.6.3 (which is the one I want you to show the output from in your file that you print and turn in) train your taggers on section k (general fiction) and then test them on section 'l' mystery fiction. How did this change the accuracy? Now retest the taggers on section 'h' (government press releases). How did your part of speech tagger fare here when it was tested on a different genre than it was trained on? Comment on this in your text file.

And that's it. Hopefully this hands on work has helped you understand a bit more about the everyday tools used in some natuaral language processing work. Print your file with the output from the three relevent sections below and bring it to class next time.