Hacker News new | ask | show | jobs
by GodelNumbering 429 days ago
Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.

Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls

I ended up banning every SEO bot in robots.txt and a bunch of other bots

1 comments

I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)

Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.

I checked IPs on those, they belonged to MSFT
Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?

If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?

I haven't tested that behavior, sorry.
No worries. I'll get around to it. I was just curious if you might've explored a bit. Thank you.