Y
Hacker News
new
|
ask
|
show
|
jobs
by
Nextgrid
153 days ago
Running the bot nowadays is hard, because a lot of sites will now block you - not just by asking nicely via robots.txt, but by checking your actual source IP. Once they see it's not Google, they send you a 403.
1 comments
eloisius
153 days ago
Cloudflare’s ubiquity makes bootstrapping a search index via crawler virtually impossible, but what about data sources like Common Crawl?
link