Hacker News new | ask | show | jobs
by londons_explore 693 days ago
for read-only content, I just stick it behind a cache and let the bots go wild.
3 comments

You can't cache this stuff for bot consumption. Humans only want to see the popular stuff. Bots download everything. The size of your cache then equals the size of your content database.
But you can still make sure that you save the data in a form where generating the served webpage takes the least amount of time. For most websites this means saving the HTML - in a giant cache or with a more deliberate pre-generation setup.
The data structure to html conversion takes milliseconds. That's distinction without a difference.
Clearly not the case with most websites. And "milliseconds" are already a huge amount of time. Video games simulate huge worlds and render complex 3D graphics within 16ms or even much less with the >60 framerates that are expected these days.
You've gotten off topic here. The giant cache you speak of approaches the size of the content database when designing for the long tail of bots. A giant cache is non economic and thus not a solution unless you're an AWS salesman.
Yes, the cache ends up being bigger than the content database, but for text content that's typically not a problem. The human effort to type some text always hugely exceeds the cost of a few kilobytes of flash to store what they typed in a ready-to-serve form.

The generation process of taking the raw text and assembling the page around it is typically rather expensive for most CMS systems. Sure - it isn't theoretically expensive, but unless you want to engineer a CMS from scratch most people just pick one off the shelf and then end up having to pay the CPU time overhead of wordpress etc.

Not at all. You can either design your website so that pages can be retrieved in sub-millisecond time (and that doesn't have to mean throwing money at cloud providers) or you can cry about bots.
That’s just passing the buck.

Someone still needs to pay for that traffic. If it gets too much for cloud flare or whoever, you’re gonna get the bill.

Traffic is not all that expensive though if you are not using a cloud provider where that's how they squeeze captured customers.
I presume OSM has already considered this and ruled it out (probably because the map should be dynamic)