Hacker News new | ask | show | jobs
by drakaal 4803 days ago
We do extractive summarization because we don't feel that changing the authors words is fair use. We could do rewriting. We actually have an in house demo that for lack of a better word build Wikipedia pages for animals. (animals have fixed traits so it is easier than if we were to try and do general people and the information on them changes much less frequently)

I don't have time to do conference papers.

Our pipeline requires almost every one of our capabilities in order to do TLDR.

We have to grab the page. We have to separate the content from the theme. We have to convert the HTML to a not HTML "thing" that lets us work on the text but maintain the HTML. Then we have to Disambiguate/Segment the sentences. Then we have to analyze the type of content to pick how we are going to summarize it, which requires all the noun, and stemming and keyword analysis, then we have to rank the sentences in importance based on concepts and causation, and readability, and emotion. Then we have to put all the HTML back, and present it to the user.

We set the goal that Tom Sawyer can't take more than 45 seconds to run.

1 comments

Fair use or not, if you could do it I would buy it :) Fine, forget conference papers. If you can demonstrate fast NER in multiple languages, across domains, with competitive precision/recall metrics, I will buy it. The rest of it is not particularly interesting to me because it's frankly not that hard.