Hacker News new | ask | show | jobs
by nreece 5865 days ago
The 'cleaner' copy of the same article: http://beta.thehindu.com/arts/magazine/article435036.ece
1 comments

A small piece of my weekend project that I'm working on is extracting text from articles. You can try it on this one:

http://toadjaw.com/article?url=http%3A%2F%2Fwww.hindu.com%2F...

Nice. Did you just recreate the Readability algorithm or are you trying your own approach?
In large part I ported over Readability. Although It's not exactly the same since I'm doing some additional processing. I started it on Friday and had it finished up yesterday so it's still pretty rough, but working pretty well.

You can try more here: http://toadjaw.com/article

This is great and very useful. Are there plans to add unicode support? http://toadjaw.com/article?url=http://www.tdkterim.gov.tr/bt...
That's horrible, I'll get that fixed. Shouldn't be a problem.

Edit: It's fixed now.

Edit: And maybe not, since my change broke other pages. I'll have to think about this.

Nice. Readability does that too - http://lab.arc90.com/experiments/readability/
Yup, that was the basis of my code. With some modifications, I basically ported the javascript to C#.