Hacker News new | ask | show | jobs
by jondot 4974 days ago
Sorry, this isn't rocket science at all.

Standard clustering algorithms (found in any off-the-shelf natural text processing library) and text summation with libots should suffice for most of the heavy lifting.

http://tldr.it/ http://libots.sourceforge.net/

Further, most news articles' first paragraph is a practical (although you may have not noticed) summary.

Coming from NLP, unless you can influence the source and the source being Web, the story should be an 80%-20% in the best case -- and you'll work VERY hard to correct the remaining 20%, and you WILL remain with a percentage of content you just can't summarize properly.

What would make a difference is a real people-driven summation, not machines (see what voicebunny did for text-to-speech, for example). And yes, it would have been fun to combine the two as well.

1 comments

I experienced an article in TheVerge which is mainly a video as its content.

What I will be amazed is a good automatic summarization algorithm that is using abstraction and not just extraction.

Also, check out circa (http://cir.ca/). Never tried, but as I read, it uses both human and algorithm to "summarize" articles.

circa is a good idea, actually. From my close experience with this field, when a news article will be published it will be edited and republished many times, over many forms and shapes (Web, RSS, etc.) in many of these steps, a manual, human work is needed -- and this affects the volumes of the published news.

Further, many of the news really originate from relatively limited sources (reuters, etc), so you can plug your solution there as well.

Therefore it should be OK to assume that if you put humans at the same pipeline to summarize news manually, the capacity and efficiency will be reasonable.

The problem in summarizing news manually is that it takes too much effort for a human to do it. The efficiency may be good, but as many news pass by, his efficiency will go down. (assuming that he's only the one summarizing)
True, but my point is people are already doing it at the start of the pipeline. Think what happens when Reuters decide to make a SaaS offering of their summarized content. Even regardless of that, you can hire a battery of professional summarizers instead of PHDs and do it pretty well.

Where this doesn't apply, and where I do think you're completely right is non-news articles: think blogs, tweets (although there's not much to summarize in 140chars), product descriptions, scientific articles, etc. These things are produced in much more volume and much less workflow around them.