| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jkldotio 4833 days ago

Nice. The output is similar to that of my rss ingest pipeline for http://jkl.io although I've yet to add my custom document/topical hash, sentiment and topical classifiers directly but it has article, stemmed article, first sentence (which will evolve to summary), named entities and resolves url redirects.

I am thinking I should clean up the code, add a few more extractors and release it soon as a url analysis library (I was thinking "demands" would be a good name to pair with Python's "requests"). I would like to get entity disambiguation from Wikipedia in it first though as I think that is a vital feature. My funding pitch largely failed though so I will approach that somewhat more slowly, but the methodology and libraries for constructing reasonable entity disambiguation from topic modelling (rather than heaviest sub-graph approaches) are out there.

I recently saw an API on HN selling basically this type of extraction from urls, but I think it's necessary (along with Common Crawl and other such things) for this base layer to be there for free so people can properly compete with Google. I think Google currently runs 200+ extractors and classifiers on every page, so they have a huge advantage over startups (and non-profits which is my area of interest) in this area which Common Crawl can't help with by just providing the raw data.

3 comments

sdoering 4832 days ago

As I am trying to learn some basics on automated text-processing and categorization, I am always fond of these experiments/ideas like yours.

The idea of releasing it as demands sounds great. I would love to hear from you, when it is released.

link

aoroz 4833 days ago

I like your prototype. Works well and looks clean. I have also been trying to work on something like this.

link

dreadsword 4833 days ago

You should have outbid Yahoo for Summly to level up your first sentence summaries!

link