Hacker News new | ask | show | jobs
by marginalia_nu 1486 days ago
Cool.

But a warning, based on doing quite a lot of crawling from home through my own search engine, it's very easy to have your IP or IP-block end up on annoying graylists where basically every other website you visit will throw a CAPTCHA in your face. I'm aware this is a risk and use a VPN for most of my private web surfing anyway so it's not that much of a bother, but it's a bit sketchy to expose other people to that risk through something like this.

It would probably be wise to use canned crawls for major websites, maybe something like trading WARCs <https://en.wikipedia.org/wiki/Web_ARChive> over bit-torrent or whatever. Most of these types of websites don't change that often in the places that matter.

1 comments

Thanks for the feedback! I’ll keep that in mind as this is built out. Fortunately the initial bootstrapping uses data from the Internet Archive and the crawls afterwards is to check for updates (at a reasonable rate). The number of URLs being hit is much much lower in the end than you would think.