| TLDR: GPT Bot is systematically accessing my private ubuntu mirror, ignoring the robots.txt Today in the morning I woke up to the following message from Cloudflare about my quota usage on Cloudflare workers >> Your account has reached 75% of its daily requests limit for Cloudflare Workers and/or Pages Functions This is unusual as only have one worker on my Cloudflare account that proxies my apt repos for my personal PC to specific upstream services. As much as the domain is public, it is not posted anywhere and only used for my home PCs. So i get the Cloudflare worker logs and see about 160k requests in the last 24 hours, up from barely 24(yes 24 in total) to various packaged via my proxy. Extracted part of the logs is as below >> {
>> "headers": {
>> "accept": "/",
>> "accept-encoding": "gzip, br",
>> "cf-connecting-ip": "74.7.227.53",
>> "cf-ipcountry": "US",
>> "cf-ray": "9d388b074b38d3be",
>> "cf-visitor": "{"scheme":"https"}",
>> "connection": "Keep-Alive",
>> "from": "gptbot(at)openai.com",
>> "host": "XXXXXXXXXXXXXXXXX.brotich.workers.dev",
>> "referer": "https://XXXXXXXXXXXXXXXXX.brotich.workers.dev/ubuntu/pool/universe/z/zephyr/",
>> "user-agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)",
>> "x-forwarded-proto": "https",
>> "x-openai-host-hash": "103003167",
>> "x-real-ip": "74.7.227.53"
>> }
>> } as you can see, the request is from GPTBot that collect training data. Now the annoying bit:
- according to openapi, they respect robots.txt. I have this set up on my domain as follows >>> # BEGIN Cloudflare Managed content
>>>
>>> User-agent: *
>>> Content-Signal: search=yes,ai-train=no
>>> Allow: /
>>>
>>> User-agent: Amazonbot
>>> Disallow: /
>>>
>>> User-agent: Applebot-Extended
>>> Disallow: /
>>>
>>> User-agent: Bytespider
>>> Disallow: /
>>>
>>> User-agent: CCBot
>>> Disallow: /
>>>
>>> User-agent: ClaudeBot
>>> Disallow: /
>>>
>>> User-agent: Google-Extended
>>> Disallow: /
>>>
>>> User-agent: GPTBot
>>> Disallow: /
>>>
>>> User-agent: meta-externalagent
>>> Disallow: /
>>>
>>> # END Cloudflare Managed Content This is just a hobby project, and I have put safeguards on Cloudflare to prevent scarping by bot. there is nothing of value in there. it's just a proxy for my own use. why say you respect robots.txt if you dont? |
UA-based blocking will always be a game of whack-a-mole for this. Cloudflare's bot score (available in the Workers environment as `cf.bot_management.score`) is a lot more durable — you can rate-limit or challenge anything under 30 without caring what UA they claim. Pair that with a Turnstile challenge on any endpoint you actually need to protect, and you remove the attack surface entirely rather than blocking individual bots.
Longer term, the workers.dev exposure is worth auditing across all your workers, not just this one. What does your other worker surface look like — are these all behind your main domain or do other workers have the same split-zone problem?