Hacker News new | ask | show | jobs
by rockwotj 617 days ago
I worked for a short time on SearchGPT, and I can tell you OpenAI does respect robots.txt , at least when I was there and does now. They are also careful to shard per domain and only crawl each domain at a small rate (~1 qps) as to not ddos the site. OpenAI also uses User Agent strings to identify itself: https://platform.openai.com/docs/bots

They have dedicated user agents for search crawling, when a user directly asks about a site and for training data.

1 comments

> They are also careful to shard per domain and only crawl each domain at a small rate (~1 qps) as to not ddos the site.

Maybe that's their intent, but this was only a month ago: https://www.gamedeveloper.com/business/-this-was-essentially...

> "The homepage was being reloaded 200 times a second, as the [OpenAI] bot was apparently struggling to find its way around the site and getting stuck in a continuous loop," added Coates. "This was essentially a two-week long DDoS attack in the form of a data heist."

Maybe someone went against the rule of deploying on a Friday, ouch.