Hacker News new | ask | show | jobs
by drakaal 4807 days ago
Install the TLDR plugin. Pick a web site. Or better yet go out to project gutenberg pick a book. Tom Sawyer. Push TLDR. Way faster than 200ms per sentence.

Yes it does most Germanic and Romance languages.

Yes it does domain independent named entities with a a higher score than anything else on the planet. ALL English classes. Medical, Dental, Animal. (that doesn't include Latin uses of animal names) Technical.

As I said we are just stepping out of stealth. I linked a PDF in the comments here.

3 comments

Thanks, that's somewhat helpful. I am not particularly interested in the summarizer plugin itself (mostly because we have one, built in house), but I would love to talk about the underlying pipeline. If you have e.g. a named entity recognition library that performs as well as you say in Romance languages on standard data sets, you have material for at least one conference paper, and furthermore a product much more valuable than the summarizer itself.

My question about speed referred to syntactic parsing specifically. I am sure you can do entropy scoring faster than 200ms per sentence, but unless you have access to parses you are unlikely to be able to do more than purely extractive summarization. That's what Summly does, and every other summarizer on the planet as well. (Except perhaps Columbia's Newsblaster, but that's a bit of a different story).

We do extractive summarization because we don't feel that changing the authors words is fair use. We could do rewriting. We actually have an in house demo that for lack of a better word build Wikipedia pages for animals. (animals have fixed traits so it is easier than if we were to try and do general people and the information on them changes much less frequently)

I don't have time to do conference papers.

Our pipeline requires almost every one of our capabilities in order to do TLDR.

We have to grab the page. We have to separate the content from the theme. We have to convert the HTML to a not HTML "thing" that lets us work on the text but maintain the HTML. Then we have to Disambiguate/Segment the sentences. Then we have to analyze the type of content to pick how we are going to summarize it, which requires all the noun, and stemming and keyword analysis, then we have to rank the sentences in importance based on concepts and causation, and readability, and emotion. Then we have to put all the HTML back, and present it to the user.

We set the goal that Tom Sawyer can't take more than 45 seconds to run.

Fair use or not, if you could do it I would buy it :) Fine, forget conference papers. If you can demonstrate fast NER in multiple languages, across domains, with competitive precision/recall metrics, I will buy it. The rest of it is not particularly interesting to me because it's frankly not that hard.
Clothing, Textiles... We did recently learn that I missed furniture. Apparently a curio cabinet is not something that I was getting... but we get chest of drawers just fine, and writing desk. We even get all the weird dogs.
I tried it on http://paulgraham.com/startupideas and here's what it gave me:

"How to Get Startup Ideas

[1] [2] [3] You want to know how to paint a perfect painting? It's easy. Make yourself perfect and then just paint naturally. Live in the future, then build what's missing. [4] [5] [6] [7] Live in the future and build what seems interesting. [8] [9] [10] 10 [11] 11 [12] 12 [13] 13 [14] 14 [15] 15 [16] 16 [17] 17"

doesn't seem to work at all...

Highlight the part you want to summarize. Like the part with out the Notes.

Also Paul's writing is pretty poor. The ideas are good, but he jumps around and uses short sentences with far too many pronouns.

Garbage in Garbage out.

Here is the 25% version, which I think is Readable:

The way to get startup ideas is not to try to think of startup ideas. And yet by far the most common mistake startups make is to solve problems no one has.

I made it myself. But galleries didn't want to be online. Because I didn't pay attention to users. Because they begin by trying to think of startup ideas. That m.o. is doubly dangerous: it doesn't merely yield few good ideas; it yields bad ideas that sound plausible enough to fool you into working on them.

At YC we call these "made-up" or "sitcom" startup ideas. But coming up with good startup ideas is hard.

For example, a social network for pet owners. Millions of people have pets. Choose the latter. Not all ideas of that type are good startup ideas, but nearly all good startup ideas are of that type.

Made-up startup ideas are usually of the first type.

Nearly all good startup ideas are of the second type. If you can't answer that, the idea is probably bad. But you almost always do get it.

But while demand shaped like a well is almost a necessary condition for a good startup idea, it's not a sufficient one. If Mark Zuckerberg had built something that could only ever have appealed to Harvard students, it would not have been a good startup idea. Facebook was a good idea because it started with a small market there was a fast path out of. So you spread rapidly through all the colleges. Often you can't.