Unsupervised Part Of Speech Lexicon Induction Output

This page is a distribution site for unsupervised POS lexicon induction output for English. This output was generated from the system introduced in the following paper:

Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages.
Sajib Dasgupta and Vincent Ng.
In the proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), Prague, 2007.

Please see my thesis for more details:
Toward Language Independent Morphological Segmentation and Part-of-speech Induction
Advisor: Vincent Ng, University of Texas at Dallas.

Here are the files:

WSJ Frequency>=5 : Size 8.3K word types. Here, POS output of the words in WSJ with frequency >=5 is shown. Note that, all words are lowercased.

WSJ All : Size 20.5K word types. Here's the output of all of the words in the WSJ except those whose feature vector is too sparse (no more than one feature is present). See Section 3.8 of my thesis for more details.

More : Size: 26.2K word types. It contains more words than WSJ. Note that, our unsupervised analyzer generates words in the subcluster formation step, some of them don't appear in WSJ.

Cluster Information : List of open class POS tags induced by our clustering system.