Hacker News new | ask | show | jobs
by fenwick67 2341 days ago
One thing that might hook me is if you said which sites are scraped.
1 comments

I know a lot of people are asking for this, but it would literally fill up this whole page listing all the URLs.

Best way I can describe it is "publicly allowable URLs on the web" which would include blogs, forums, social networks, websites, and more. If we can pick up the HTML/RSS/ATOM/Json/text.. then we try to get it.

We don't scrape any sites that disallow it in their robots txt and we don't scrape material only available behind logins and paywalls.