Hacker News new | ask | show | jobs
GPT Bot Ignoring Robots.txt on my cloudflare worker
3 points by white_viel 120 days ago
TLDR: GPT Bot is systematically accessing my private ubuntu mirror, ignoring the robots.txt

Today in the morning I woke up to the following message from Cloudflare about my quota usage on Cloudflare workers

>> Your account has reached 75% of its daily requests limit for Cloudflare Workers and/or Pages Functions

This is unusual as only have one worker on my Cloudflare account that proxies my apt repos for my personal PC to specific upstream services. As much as the domain is public, it is not posted anywhere and only used for my home PCs.

So i get the Cloudflare worker logs and see about 160k requests in the last 24 hours, up from barely 24(yes 24 in total) to various packaged via my proxy.

Extracted part of the logs is as below

>> { >> "headers": { >> "accept": "/", >> "accept-encoding": "gzip, br", >> "cf-connecting-ip": "74.7.227.53", >> "cf-ipcountry": "US", >> "cf-ray": "9d388b074b38d3be", >> "cf-visitor": "{"scheme":"https"}", >> "connection": "Keep-Alive", >> "from": "gptbot(at)openai.com", >> "host": "XXXXXXXXXXXXXXXXX.brotich.workers.dev", >> "referer": "https://XXXXXXXXXXXXXXXXX.brotich.workers.dev/ubuntu/pool/universe/z/zephyr/", >> "user-agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)", >> "x-forwarded-proto": "https", >> "x-openai-host-hash": "103003167", >> "x-real-ip": "74.7.227.53" >> } >> }

as you can see, the request is from GPTBot that collect training data.

Now the annoying bit: - according to openapi, they respect robots.txt. I have this set up on my domain as follows

>>> # BEGIN Cloudflare Managed content >>> >>> User-agent: * >>> Content-Signal: search=yes,ai-train=no >>> Allow: / >>> >>> User-agent: Amazonbot >>> Disallow: / >>> >>> User-agent: Applebot-Extended >>> Disallow: / >>> >>> User-agent: Bytespider >>> Disallow: / >>> >>> User-agent: CCBot >>> Disallow: / >>> >>> User-agent: ClaudeBot >>> Disallow: / >>> >>> User-agent: Google-Extended >>> Disallow: / >>> >>> User-agent: GPTBot >>> Disallow: / >>> >>> User-agent: meta-externalagent >>> Disallow: / >>> >>> # END Cloudflare Managed Content

This is just a hobby project, and I have put safeguards on Cloudflare to prevent scarping by bot. there is nothing of value in there. it's just a proxy for my own use.

why say you respect robots.txt if you dont?

6 comments

The workers.dev bypass is a known gap — Cloudflare's zone-level WAF doesn't apply to the workers.dev subdomain by default, so anything you've built in front of your real domain is irrelevant once a bot figures out the direct route. You already hit this. The fact it adapted after 403s suggests it's not just a passive crawler either, it's doing something closer to active probing.

UA-based blocking will always be a game of whack-a-mole for this. Cloudflare's bot score (available in the Workers environment as `cf.bot_management.score`) is a lot more durable — you can rate-limit or challenge anything under 30 without caring what UA they claim. Pair that with a Turnstile challenge on any endpoint you actually need to protect, and you remove the attack surface entirely rather than blocking individual bots.

Longer term, the workers.dev exposure is worth auditing across all your workers, not just this one. What does your other worker surface look like — are these all behind your main domain or do other workers have the same split-zone problem?

Hey! Saw your post about GPTBot eating up your Cloudflare Workers quota. The brutal truth is that relying on robots.txt in 2026 is like putting a 'please do not enter' sticky note on a bank vault. AI scrapers are notoriously ignoring them or rotating IPs. You are literally paying Cloudflare so OpenAI can train their models. I build custom infrastructure defenses for SaaS founders. Instead of hoping they respect your robots.txt, I can set up specific Cloudflare WAF (Web Application Firewall) rules and Edge Workers that fingerprint AI scrapers (even when they spoof user-agents) and drop the connection at the edge before it hits your billing quota. If you want to permanently lock them out and protect your server bill, let's chat. I can share the JSON ruleset for your Cloudflare dashboard.
Finally managed to kill the traffic: 1. renamed the worker - the bot was bypassing the route on the domain and using the worker.dev domain 2. add WAF rules to block gptbot in user agent 3. serve zip bomb on request from GPT.

interesting that GPT was accessing the worker directly to bypass the WAF rule son Cloudflare

Interesting. workers.dev domains can be a liability sometimes -- if you've mapped the worker to a real zone, then you probably don't want the workers.dev zone anymore.

For what it's worth, you can disable the workers.dev zone by putting `"workers_dev": false,` in wranlger.jsonc. You can also enable Cloudflare Access on your workers.dev zone to require login (there's a switch for this in the cloudflare dashboard UI for the worker).

But of course you have to remember to do those things... I wonder if we (Cloudflare) should be more proactive in suggesting disabling/locking down the workers.dev zone once a worker is mapped to another zone...

>> interesting. workers.dev domains can be a liability sometimes -- if you've mapped the worker to a real zone, then you probably don't want the workers.dev zone anymore.

maybe that is a good idea, prompt a user to disable the worker.dev domain. however, having the worker.dev domain open for me is more of a backup way to access the worker as the mapped domain is more for a hobby project.

> Interesting. workers.dev domains can be a liability sometimes

what about allowing the specific worker.dev to serve a robots.txt on its own that can be used to disable the AI bots?

i.e https://XXXXXXXXXXX.username.workers.dev/robots.txt should be allowed to be configure on a worker level. not sure how that affects the design of the infra, but it would be a good idea

but i am impressed that gpt was able to decipher the worke.dev url on its own and access it.
update: the bot is back now with a vengeance, sending request at about 1 request per second. ignoring robots.txt and the status code 403
serving a zip bomb and after 10 minutes, the traffic from the gpt bot disappeared..
will be serving a zip bomb to the bot to see if they stay away from my proxy