| We feel this at work too. We run a book streaming platform with all books, booklists, authors, narrators and publishers available as standalone web pages for SEO, in the multiple millions. Last 6 months have turned into a hellscape - for a few reasons: 1. It's become commonplace to not respect rate limits 2. Bots no longer identify themselves by UA 3. Bots use VPNs or similar tech to bypass ip rate limiting 4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting 5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner. I guess the solution is to just scale up the origin servers... /shrug In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please. |
Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.