I’m currently investigating “part of speech” taggers (commonly referred to as simply POS taggers) and came across a relatively new Java API for POS tagging developed by the Stanford Natural Language Processing Group. What I really like about this API is its simplicity - other POS tagger APIs have not been as simple. However, it is understandable that many POS tagger APIs are complex because POS tagging, while relatively simple for us humans, is not easy for software.
Here’s why wikipedia says this problem is not easy:
Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times. This is not rare -- in natural languages (as opposed to many artificial languages), a huge percentage of word-forms are ambiguous. For example, even "dogs" which is usually thought of as a just a plural noun, can also be a verb:
The sailor dogs the hatch.
"Dogged", on the other hand, can be either an adjective or a past-tense verb. Just which parts of speech a word can represent varies greatly.
In order to pick a POS tagger that works best for my current needs, I plan on outlining each tagger API and scoring each API across a range of attributes. However, as a result of my research up to this point, I’ve learned it’s difficult to visualize how well POS taggers actually perform. Because of this difficulty, I’ve created a simple demo application that will load Text or RTF documents and allow the parts of speech in an opened document to be clearly highlighted.
Here is a screenshot of the demo application with all nouns for a document highlighted:

To try it out, feel free to download the demo HERE. (Don’t worry, it’s easy – I created an installer using NSIS to make it easy to share.)
The demo is using the Stanford Natural Language Group’s POS Tagger (see http://nlp.stanford.edu/software/tagger.shtml) to perform the actual POS tagging. I plan on adding others POS taggers to the demo in order to evaluate 1) ease of integration and 2) effectiveness as POS tagging. So, in other words, stay tuned!
(In case you’re wondering, the demo was developed using SWT & Eclipse RCP; the source is availabe HERE.)
| Attachment | Size |
|---|---|
| pos-tagger-screenshot.jpg | 79.2 KB |
Comments
Hi:
I am trying to use this within the framework of JSP and servlet mechanism with Tomcat serving the pages. We are not successful so far. So, I was wondering whether you have tried this approach and if would appreciate any pointers. Thanks.
-Vasu
Dear Mr. Vanhook,
My name is Sebastian Gallese and I am student at Brown University in Providence, RI. I think you might be interested in an art and technology project I am helping build. It is called Osiris- a music visualizer that uses the lyrics of a song and images from the Internet.
Our group members are intrigued by your work on this parts of speech (POS) tagger. Although we do understand you have a busy schedule, we would love even just to talk to you about our project and hear whatever comments you may have.
Thank you,
Sebastian Gallese
Someone inquired about re-using the source or binary form of this demo and to make it perfectly clear, knock yourself out! This demo was intended to demonstrate the capabilities of - and integration with - the Stanford POS Tagger. So, feel free to re-use the source or demo in any way you like.
Post new comment