I’ve taken a dive in Natural Language Processing over the last few weeks. My initial requirement was to extract temporal information from a bunch of related feeds for a side project of mine. I did some research online and bought the excellent Natural Language Processing with Python.
The book is about NLP in general but the examples are very focued on the NLTK. NLTK is an open source toolkit for NLP in Python. It’s a really interesting tool and quite easy (especially with the book) to get into. If you have any interest in NLP and are reasonably capable with python this is a great place to start.
I hope to make more use of NLP in my project but to begin with here is how I dealt with the initial, relatively straight forward, requirement.
With a little googling I discovered contrib module called timex. Timex has two functions; tag() identifies temporal expressions in a text and tags them with a TimeML tag and ground() which fills the val attribute of the tags with the value represented by the temporal expression.
>>> from nltk_contrib.timex import *
>>> content =”Belfast up and coming band Cashier No.9 are playing a free gig in the Mercantile this Saturday night at 9.30pm. The band are set to release their debut album To The Death of Fun, which is mixed by David Holmes, in March.”
‘Belfast up and coming band Cashier No.9 are playing a free gig in the Mercantile <TIMEX2>this Saturday</TIMEX2> night at 9.30pm. The band are set to release their debut album To The Death of Fun, which is mixed by David Holmes, in March.’
As you can see from the example timex neatly tags relative temporal expressions like “this Saturday”. This is essential when trying to parse feeds as people usually talk about time relatively rather than specifiying an absolute date.
The ground method can then be used to convert this relative temporal expression into an absolute date like so:
>>> content = tag(content)
>>> print gmt()
‘Belfast up and coming band Cashier No.9 are playing a free gig in the Mercantile <TIMEX2 val=”2011-01-22″>this Saturday</TIMEX2> night at 9.30pm. The band are set to release their debut album To The Death of Fun, which is mixed by David Holmes, in March.’
The second parameter is is required in order to calculate the date. In this case I have just specified the current date/time (2011-01-21 11:03:49.67) but in the case of feeds the published date of the entry is probably most appropriate.
This is a great leap forward but I’m not totally happy with the results. There are many other common formats used in the feeds that aren’t handled well by timex. For example:
>>> tag(“13th January”)
>>> tag(“this January”)
>>> tag(“2011-01-21 11:03:49.67”)
>>> tag(“9:00pm 13/1/2011”)
I think all lot of these can be improved on relatively easily. So next post I should have some suggestions for how to improve on timex.