Hacker News new | ask | show | jobs
by PittleyDunkin 599 days ago
How do you differentiate between "ai" (whatever that means) and other crawlers?
2 comments

You don't. Theoretically, they would respect the user agent, but who can trust that anymore?
And it is a fine pickle we jarred ourselves into. We thought it would be sweet but it just came out dill.
More than user-agent, because user-agent cannot be trusted.
Great! Well then... how?
HAProxy Edge is their product, and akin to Cloudflare and other competitors the heuristics to stifle bad actors is likely the secret sauce. Disclosing it would only lend bad actors the advantage in their game of cat and mouse.
Their user-agent.
Ah, so this is just marketing.
(disclaimer: i wrote that post)

It is not. We rely on more than User Agents because they are too often faked, so it is not just marketing. There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.

> There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.

Great! What are these signals? That seems to be the meat of the post but it's conspicuously absent. How are we supposed to validate the post?

> how are we supposed to validate the post?

Imagine you were a vendor who were trying to trick the author into divulging his methods. Can a stranger on the Internet be trusted?

I imagine if that information is disclosed, you won’t be able to verify it, as it will be bypassed… because it was disclosed.
What a wonderful world we live in where serious people are expected to believe press releases based purely on brand prestige.
So user-agent and whois to see if it's coming from a plausible netblock and accept: strings, http header stuff?