| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by LisaG 3211 days ago

As long as you obey robots.txt there is nothing wrong with crawling. Your code in GitHub doesn't give any indication of what sites you collect data from so there is no indication that you are scraping instead of using it to crawl in an acceptable manner. Though it wouldn't hurt to label your work as crawler scripts instead of scraping scripts ;)

Why use your own scripts and not Nutch?

Do you know about Common Crawl? https://aws.amazon.com/public-datasets/common-crawl/ It obeys robots.txt so it may not have everything you want, but it could save you part of the effort of crawling yourself.