| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Ian_Kerins 99 days ago
	A lot of the discussion around the /crawl endpoint seems to miss a key detail in the docs. The crawler explicitly identifies itself as a bot, respects robots.txt, and does not bypass CAPTCHAs, WAF rules, or Cloudflare Bot Management. So technically it’s a nice managed crawling system, but in practice it only works on sites that already allow bots to crawl them. For many real-world data extraction use cases, the problem isn’t crawling infrastructure, it’s dealing with sites that actively block bots. In those cases you still need traditional scraping approaches.