| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PittleyDunkin 599 days ago
	How do you differentiate between "ai" (whatever that means) and other crawlers?

2 comments

yazzku 599 days ago

You don't. Theoretically, they would respect the user agent, but who can trust that anymore?

link

Sparkyte 599 days ago

And it is a fine pickle we jarred ourselves into. We thought it would be sweet but it just came out dill.

link

jcat123 599 days ago

More than user-agent, because user-agent cannot be trusted.

link

PittleyDunkin 599 days ago

Great! Well then... how?

link

prophesi 599 days ago

HAProxy Edge is their product, and akin to Cloudflare and other competitors the heuristics to stifle bad actors is likely the secret sauce. Disclosing it would only lend bad actors the advantage in their game of cat and mouse.

link

superkuh 599 days ago

Their user-agent.

link

PittleyDunkin 599 days ago

Ah, so this is just marketing.

link

jcat123 599 days ago

(disclaimer: i wrote that post)

It is not. We rely on more than User Agents because they are too often faked, so it is not just marketing. There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.

link

PittleyDunkin 599 days ago

> There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.

Great! What are these signals? That seems to be the meat of the post but it's conspicuously absent. How are we supposed to validate the post?

link

signatoremo 599 days ago

> how are we supposed to validate the post?

Imagine you were a vendor who were trying to trick the author into divulging his methods. Can a stranger on the Internet be trusted?

link

dgfitz 599 days ago

I imagine if that information is disclosed, you won’t be able to verify it, as it will be bypassed… because it was disclosed.

link

PittleyDunkin 599 days ago

What a wonderful world we live in where serious people are expected to believe press releases based purely on brand prestige.

link

superkuh 599 days ago

So user-agent and whois to see if it's coming from a plausible netblock and accept: strings, http header stuff?

link