| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nreece 5912 days ago
	The 'cleaner' copy of the same article: http://beta.thehindu.com/arts/magazine/article435036.ece

1 comments

ronnier 5912 days ago

A small piece of my weekend project that I'm working on is extracting text from articles. You can try it on this one:

http://toadjaw.com/article?url=http%3A%2F%2Fwww.hindu.com%2F...

link

SMrF 5912 days ago

Nice. Did you just recreate the Readability algorithm or are you trying your own approach?

link

ronnier 5912 days ago

In large part I ported over Readability. Although It's not exactly the same since I'm doing some additional processing. I started it on Friday and had it finished up yesterday so it's still pretty rough, but working pretty well.

You can try more here: http://toadjaw.com/article

link

zeynel1 5912 days ago

This is great and very useful. Are there plans to add unicode support? http://toadjaw.com/article?url=http://www.tdkterim.gov.tr/bt...

link

ronnier 5912 days ago

That's horrible, I'll get that fixed. Shouldn't be a problem.

Edit: It's fixed now.

Edit: And maybe not, since my change broke other pages. I'll have to think about this.

link

nreece 5912 days ago

Nice. Readability does that too - http://lab.arc90.com/experiments/readability/

link

ronnier 5912 days ago

Yup, that was the basis of my code. With some modifications, I basically ported the javascript to C#.

link