Hacker News new | ask | show | jobs
by marcusb 429 days ago
I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)

Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.

2 comments

I checked IPs on those, they belonged to MSFT
Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?

If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?

I haven't tested that behavior, sorry.
No worries. I'll get around to it. I was just curious if you might've explored a bit. Thank you.