Hacker News new | ask | show | jobs
by kurumo 4803 days ago
Thanks, that's somewhat helpful. I am not particularly interested in the summarizer plugin itself (mostly because we have one, built in house), but I would love to talk about the underlying pipeline. If you have e.g. a named entity recognition library that performs as well as you say in Romance languages on standard data sets, you have material for at least one conference paper, and furthermore a product much more valuable than the summarizer itself.

My question about speed referred to syntactic parsing specifically. I am sure you can do entropy scoring faster than 200ms per sentence, but unless you have access to parses you are unlikely to be able to do more than purely extractive summarization. That's what Summly does, and every other summarizer on the planet as well. (Except perhaps Columbia's Newsblaster, but that's a bit of a different story).

1 comments

We do extractive summarization because we don't feel that changing the authors words is fair use. We could do rewriting. We actually have an in house demo that for lack of a better word build Wikipedia pages for animals. (animals have fixed traits so it is easier than if we were to try and do general people and the information on them changes much less frequently)

I don't have time to do conference papers.

Our pipeline requires almost every one of our capabilities in order to do TLDR.

We have to grab the page. We have to separate the content from the theme. We have to convert the HTML to a not HTML "thing" that lets us work on the text but maintain the HTML. Then we have to Disambiguate/Segment the sentences. Then we have to analyze the type of content to pick how we are going to summarize it, which requires all the noun, and stemming and keyword analysis, then we have to rank the sentences in importance based on concepts and causation, and readability, and emotion. Then we have to put all the HTML back, and present it to the user.

We set the goal that Tom Sawyer can't take more than 45 seconds to run.

Fair use or not, if you could do it I would buy it :) Fine, forget conference papers. If you can demonstrate fast NER in multiple languages, across domains, with competitive precision/recall metrics, I will buy it. The rest of it is not particularly interesting to me because it's frankly not that hard.