Hacker News new | ask | show | jobs
by marcamillion 5863 days ago
Hrmm....it kind of works. The issue I am having now, is that on the page I tried it only does part of the page.

http://52weeksofux.com/tagged/week_1

I tried to capture that page, and it only captured the bottom story - not the top.

I never checked the source or anything, so it could very well be something about that specific site.

That was just the first site I tried, and that's what I found.

Hope that helps :)

1 comments

Ah, that is the result of the Readability, http://lab.arc90.com/experiments/readability/ algorithm, it is a bit greedy and chopped off some of the content.

If there are better algorithms/tools for scraping content please let me know.