| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nijave 50 days ago

I think there's a few things at play here

- AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites)

- AI will crawl the site looking for the correct answer which may hit a handful of pages

- AI sends requests in quick succession (big bursts instead of small trickle over longer time)

- Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents)

- At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic

That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology.

1 comments

Sesse__ 49 days ago

Also, relevant for forges: AI doesn't understand what it's clicking on. Git forges tend to e.g. have a lot of links like “download a tarball at this revision” which are super-expensive as far as resources go, and AI crawlers will click on those because they click on every link that looks shiny. (And there are a lot of revisions in a project like VLC!) Much, much more often than humans do.

link

account42 47 days ago

This is also irrelevant to the original comment which is complaining about bot checks for looking at the root of the repositiory - which is probably the highest requested resource and should be 100% served from cache with a cost much less than running the bot checks.

It's simply bad, inefficient software and we shouldn't keep making excuses for it.

link

nijave 47 days ago

Agree. Did some basic searching and looks like Gitlab is particularly bad. It ships with built in rate-limiting but the backend marks all pages as uncacheable on top of them being somewhat dynamically generated (I guess it caches "page fragments").

The only issues I found amounted to "here's how to use Anubis to block everything"

There's also some new but poorly supported standards around agents setting `Accept: text/markdown` and https://github.com/cloudflare/web-bot-auth

link