|
|
|
|
|
by marginalia_nu
1486 days ago
|
|
Cool. But a warning, based on doing quite a lot of crawling from home through my own search engine, it's very easy to have your IP or IP-block end up on annoying graylists where basically every other website you visit will throw a CAPTCHA in your face. I'm aware this is a risk and use a VPN for most of my private web surfing anyway so it's not that much of a bother, but it's a bit sketchy to expose other people to that risk through something like this. It would probably be wise to use canned crawls for major websites, maybe something like trading WARCs <https://en.wikipedia.org/wiki/Web_ARChive> over bit-torrent or whatever. Most of these types of websites don't change that often in the places that matter. |
|