|
|
|
|
|
by kurumo
4803 days ago
|
|
Thanks, that's somewhat helpful. I am not particularly interested in the summarizer plugin itself (mostly because we have one, built in house), but I would love to talk about the underlying pipeline. If you have e.g. a named entity recognition library that performs as well as you say in Romance languages on standard data sets, you have material for at least one conference paper, and furthermore a product much more valuable than the summarizer itself. My question about speed referred to syntactic parsing specifically. I am sure you can do entropy scoring faster than 200ms per sentence, but unless you have access to parses you are unlikely to be able to do more than purely extractive summarization. That's what Summly does, and every other summarizer on the planet as well. (Except perhaps Columbia's Newsblaster, but that's a bit of a different story). |
|
I don't have time to do conference papers.
Our pipeline requires almost every one of our capabilities in order to do TLDR.
We have to grab the page. We have to separate the content from the theme. We have to convert the HTML to a not HTML "thing" that lets us work on the text but maintain the HTML. Then we have to Disambiguate/Segment the sentences. Then we have to analyze the type of content to pick how we are going to summarize it, which requires all the noun, and stemming and keyword analysis, then we have to rank the sentences in importance based on concepts and causation, and readability, and emotion. Then we have to put all the HTML back, and present it to the user.
We set the goal that Tom Sawyer can't take more than 45 seconds to run.