Show HN: ParseRSS gets full-text articles from a RSS feed

Y	Hacker News new \| ask \| show \| jobs

	Show HN: ParseRSS gets full-text articles from a RSS feed (parserss.com)
	21 points by Valid 4833 days ago

3 comments

jkldotio 4833 days ago

Nice. The output is similar to that of my rss ingest pipeline for http://jkl.io although I've yet to add my custom document/topical hash, sentiment and topical classifiers directly but it has article, stemmed article, first sentence (which will evolve to summary), named entities and resolves url redirects.

I am thinking I should clean up the code, add a few more extractors and release it soon as a url analysis library (I was thinking "demands" would be a good name to pair with Python's "requests"). I would like to get entity disambiguation from Wikipedia in it first though as I think that is a vital feature. My funding pitch largely failed though so I will approach that somewhat more slowly, but the methodology and libraries for constructing reasonable entity disambiguation from topic modelling (rather than heaviest sub-graph approaches) are out there.

I recently saw an API on HN selling basically this type of extraction from urls, but I think it's necessary (along with Common Crawl and other such things) for this base layer to be there for free so people can properly compete with Google. I think Google currently runs 200+ extractors and classifiers on every page, so they have a huge advantage over startups (and non-profits which is my area of interest) in this area which Common Crawl can't help with by just providing the raw data.

link

sdoering 4832 days ago

As I am trying to learn some basics on automated text-processing and categorization, I am always fond of these experiments/ideas like yours.

The idea of releasing it as demands sounds great. I would love to hear from you, when it is released.

link

aoroz 4833 days ago

I like your prototype. Works well and looks clean. I have also been trying to work on something like this.

link

dreadsword 4833 days ago

You should have outbid Yahoo for Summly to level up your first sentence summaries!

link

sdoering 4832 days ago

Made me smile, when I tried a German RSS Feed. The sentiment-analysis was always negative, as the German word "die" (=the female/plural form) was confused for the concept of dying.

So, this really is non the less a great service. I will try to incorporate this in an experiment I am running. I am using the Readability-API till now, but it is (on German news sites) not that good in extracting the pure text-content.

It nearly always has navigation-, or advertisement-text in it. That does make it difficult to do text-analysis on the content, as I am trying to do.

link

eli 4833 days ago

Are you using something like diffbot or are you doing the scraping yourself?

Edit: Ah, I see, it's Streamified.me

link