Hacker News new | ask | show | jobs
by danpalmer 526 days ago
I've done crawling at a small startup and I've done crawling at a big tech company. This is not crawling more politely than big tech.

There are a few things that stand out, like:

> I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function.

Could this end up DOS'ing or being "impolite" just in robots.txt requests?

All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.

When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.

Here at $BigTech, the systems for ensuring "polite", and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.

It doesn't surprise me that OpenAI and Amazon don't have great systems for this, both are new to the crawling world, but concluding that "Big Tech" doesn't do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.

1 comments

It’s probably a huge liability to not have very advanced and compliant crawlers.

Accidentally ddosing several businesses seems like an expensive lawsuit.