Temporal Expressions

I’ve taken a dive in Natural Language Processing over the last few weeks. My initial requirement was to extract temporal information from a bunch of related feeds for a side project of mine. I did some research online and bought the excellent Natural Language Processing with Python.

The book is about NLP in general but the examples are very focued on the NLTK. NLTK is an open source toolkit for NLP in Python. It’s a really interesting tool and quite easy (especially with the book) to get into. If you have any interest in NLP and are reasonably capable with python this is a great place to start.

I hope to make more use of NLP in my project but to begin with here is how I dealt with the initial, relatively straight forward, requirement.

With a little googling I discovered contrib module called timex. Timex has two functions; tag() identifies temporal expressions in a text and tags them with a TimeML tag and ground() which fills the val attribute of the tags with the value represented by the temporal expression.

>>> from nltk_contrib.timex import *

>>> content =”Belfast up and coming band Cashier No.9 are playing a free gig in the Mercantile this Saturday night at 9.30pm. The band are set to release their debut album To The Death of Fun, which is mixed by David Holmes, in March.”

>>> tag(content)
‘Belfast up and coming band Cashier No.9 are playing a free gig in the Mercantile <TIMEX2>this Saturday</TIMEX2> night at 9.30pm. The band are set to release their debut album To The Death of Fun, which is mixed by David Holmes, in March.’

As you can see from the example timex neatly tags relative temporal expressions like “this Saturday”. This is essential when trying to parse feeds as people usually talk about time relatively rather than specifiying an absolute date.

The ground method can then be used to convert this relative temporal expression into an absolute date like so:

>>> content = tag(content)

>>> print gmt()
2011-01-21 11:03:49.67

>>> ground(content,gmt())

‘Belfast up and coming band Cashier No.9 are playing a free gig in the Mercantile <TIMEX2 val=”2011-01-22″>this Saturday</TIMEX2> night at 9.30pm. The band are set to release their debut album To The Death of Fun, which is mixed by David Holmes, in March.’

The second parameter is is required in order to calculate the date. In this case I have just specified the current date/time (2011-01-21 11:03:49.67) but in the case of feeds the published date of the entry is probably most appropriate.

This is a great leap forward but I’m not totally happy with the results. There are many other common formats used in the feeds that aren’t handled well by timex. For example:

>>> tag(“january”)
‘january’
>>> tag(“January”)
‘January’
>>> tag(“13th January”)
’13th January’
>>> tag(“this January”)
‘<TIMEX2>this January</TIMEX2>’
>>> tag(“2011-01-21 11:03:49.67”)
‘<TIMEX2<TIMEX2>2011</TIMEX2>-01-21 11:03:49.67</TIMEX2>’
>>> tag(“2011-01-21”)
‘<TIMEX2>2011</TIMEX2>-01-21′
>>> tag(“13/1/2011”)
’13/1/2011’
>>> tag(“9:00pm 13/1/2011”)
‘9:00pm 13/1/2011’

I think all lot of these can be improved on relatively easily. So next post I should have some suggestions for how to improve on timex.

Advertisements

About bebblebrox

I am an experienced software developer with over five year’s commercial experience with strong software design and development skills. I am searching for an ambitious start-up company with a shared passion for delivering quality software products.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s