Hacker News new | ask | show | jobs
by koolba 616 days ago
> The Bytespider bot, much like those of OpenAI and Anthropic, does not respect robots.txt, the research shows. Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website’s data.

Does any of these scrapers uniquely and unambiguously identify themselves as a bot?

Or are those days long over?

5 comments

Some of the scrapers used by big companies do identify themselves as bots by using unique user agents. Of course, it does not mean that they don't have other bots running around without the bot user agent name.

Whether those days are over or not will greatly depend on the outcome of the ongoing New York Times vs OpenAI lawsuit. If OpenAI wins, then it pretty much green lights all the other scrappers to feast upon the web

I worked for a short time on SearchGPT, and I can tell you OpenAI does respect robots.txt , at least when I was there and does now. They are also careful to shard per domain and only crawl each domain at a small rate (~1 qps) as to not ddos the site. OpenAI also uses User Agent strings to identify itself: https://platform.openai.com/docs/bots

They have dedicated user agents for search crawling, when a user directly asks about a site and for training data.

> They are also careful to shard per domain and only crawl each domain at a small rate (~1 qps) as to not ddos the site.

Maybe that's their intent, but this was only a month ago: https://www.gamedeveloper.com/business/-this-was-essentially...

> "The homepage was being reloaded 200 times a second, as the [OpenAI] bot was apparently struggling to find its way around the site and getting stuck in a continuous loop," added Coates. "This was essentially a two-week long DDoS attack in the form of a data heist."

Maybe someone went against the rule of deploying on a Friday, ouch.
This one does and I blocked them categorically from all my domains.
Most of the good ones will tag themselves in the user agent and follow robots.txt.

The ones that don't are the ones people are trying to block the most. Sometimes Google or Bing go crazy and start scraping the same resource over and over again, but most scraping tools causing load peaks are the badly written/badly configured/malicious ones.

Im thinking a lot of those issues might be related to “smart” scraping which parses JavaScript. Could lean in to the bot and just make it easier for them to scrape by removing JavaScript from the websites.

I realize this is somewhat off-topic, but the big companies kind of destroyed the internet with all the JavaScript frameworks and whatnot.

> Does any of these scrapers uniquely and unambiguously identify themselves as a bot?

It seems like all of them do, yeah: https://github.com/eob/isai/blob/b9060db7dc1a7789b322b8c2838...

Not sure if they're really "scrapers" though, if they're initiated by a user for a single webpage/website, more like "user-agents" in that case, unless it automatically fans out from there to get more content.