| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by GodelNumbering 429 days ago

Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.

Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls

I ended up banning every SEO bot in robots.txt and a bunch of other bots

1 comments

marcusb 429 days ago

I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)

Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.

link

GodelNumbering 429 days ago

I checked IPs on those, they belonged to MSFT

link

hansvm 429 days ago

Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?

If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?

link

marcusb 429 days ago

I haven't tested that behavior, sorry.

link

hansvm 429 days ago

No worries. I'll get around to it. I was just curious if you might've explored a bit. Thank you.

link