Hacker News new | ask | show | jobs
by supersan 3570 days ago
Hi, I find the Blog more interesting right now since I hope to find write-ups about how you were able to manage such a herculean task on your own?

Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.

1 comments

Some issues that appeared over the years:

Block outgoing connects to local IP nets in your firewall. Otherwise your hosting provider might think you are trying to hack them. Apparently there are a lot of links out there that point to hosts which resolve to private IP ranges.

Another problem with following links is that you are bound to run across some that are malware command & control servers. Had several complaints to my ISP after authorities took over control of one and used the C&C server's domain as a honeypot. My crawler is on a whitelist now.

I had one person who vehemently complained that I was trying to hack him, because the software downloaded his robots.txt. I'm NOT kidding! :)

Make sure your robots.txt parsing is working correctly. I had an undiscovered bug in the software at some time which basically caused it to think everything is allowed. Luckily someone was nice enough to let me know. And he was really nice about it. And he would have had every right to be angry.

A major bottleneck is DNS queries. Run your own DNS server and even cache the hostname/IP pairs yourself. Do not even think about using your IPS's DNS server. If you bombard them with 100+ DNS requests/s then they WILL be angry. :)

> Run your own DNS server and even cache the hostname/IP pairs yourself.

This[1] might be a useful resource to get started:

[1] https://scans.io/

(Register and download the IPv4 Address Space data file to use as an initial cache and then append/update as you go.)

Bookmarked. Thanks!