08.05.08

Finding relevant keywords

Posted in NPR API at 11:31 pm by Jason

One thing that NPR doesn’t provide is contextual keywords for its stories. Stories can be assigned to topics, series, columns, and programs. Certain music stories are assigned to artist and genre pages. That doesn’t really give you an idea of what the story is about. The title, teaser, and miniteaser each provide a summary in their own way, but they don’t lend themselves to grouping related stories. Search is only useful once you have an idea what you’re looking forward, but it doesn’t lend itself to finding interesting new content or following the trends as they rise and fall.

I’ve been thinking about how to go about automatically figuring out the keywords from the text. I’m trying to decide between writing my own keyword engine, or finding an API to do it for me.

The benefit to using a third party API is obvious. It means I don’t have to reinvent the wheel and I can make use of other people’s expertise. I admit that I don’t know all the science behind this kind of text parsing, so my implementation would definitely be bare bones. On the other hand, I have an advantage that a third party app wouldn’t have. I know that I’m parsing a news story and I even know what topics the editors assigned the story to. Having that information can only improve the results. Plus, I’d probably learn a lot by trying to do it myself.

Doing a basic Google search, I find a few services to do what I’m looking for. Most of them are geared towards linking to ads and SEO. This is an area with a lot of research into it, but I’m not necessarily looking for something that would have satisfied my computer science profs, just something good enough to be interesting. Maybe I should pull out my old algorithms book, the one that’s been gathering dust for 10+ years.

Let’s try an example, and see where we can go with it.

It’s assigned to the Sports and Nation topics, as well as Beijing Olympics 2008 and Profiles: Bound For Beijing series. None of those would link to stories from the Olympics in previous years, nor does it tell us that the sport is weightlifting. And it might be nice to be able to automatically look for other information about Melanie Roach, not just on NPR, but using other APIs. Besides the teaser that’s shown above, there’s a lot of text in the story that could be analyzed, from the full length story to the image captions.

I’d be curious to see what a third party service would find versus just pulling out all the capitalized phrases and figuring out which uncommon words are used most often.

One idea that I have is to take a look at the stories for a particular day and be able to figure out what NPR thought was important about that day. I think would be an interesting way to browse the news and watch the ebb and flow.

The other idea that I’m thinking about is being able to generated a tag cloud for a particular time frame. You could not only figure out what the keywords are for the stories but keep track of which ones were used more, and then display it in a visually interesting manner.

Comments are closed.