Hacker News new | ask | show | jobs
by btilly 1409 days ago
You can have massive amounts of RAM these days. You’re sooner to hit big-O limits from bad architectural decisions than run out of memory. If you do get to that point you likely have enough value in your usage to justify scaling out further and sharding.

Absolute disagreement.

It is very easily to accidentally leak a few hundred MB per week in a busy Redis system. The code will look and work fine...at first. It is correspondingly hard to track down and clean up the leak a few months later. (Particularly if there are multiple such to track down.) Yes, you can go for years just buying larger and larger EC2 instances. But that will also come with a shocking price tag.

I know of a number of organizations that this happened to. And pretty much every bad Redis story I hear about had this as a root cause. That is why I brought it up as an important consideration.

3 comments

Yes, this matches my experience.

Redis excels as a memcached alternative with some useful operations. Where people get into trouble with redis is treating it as a persistent data store, when despite it's ability to replicate and persist, redis has some constraints you need to work within. At best think of redis as something that can hold a materialized view, but where it can become corrupted at any random time, so you'll need the ability to rematerialized it from something else. And second, you absolutely have to be conscious of how close you are to ram limits.

Redis is production-ready and it has a lot of features to help you track down problems with either memory or CPU usage. For example: `redis-cli --bigkeys` will help you find the very large keys. For smaller keys that occur too often, sampling a few hundred keys should be sufficient to help you find what type of keys are taking more space than necessary.

Once you get the Redis database designed well, there is a lot of things you can do before hitting the limit where you can't install any more RAMs onto a new machine. For example, there are no more than a billion .com domains out there. Say a single record takes 100 bytes on average, consisting of the domain name and a glue record pointing to the IP of its authoritative DNS server. Then it takes just 100GB of memory to store enough information to handle all queries to .com domains in the world. It's not so hard to obtain a machine with 768GB memory these days, and 2TB machines are not uncommon.

And if you worry about the price tag - don't use EC2. You can rent a 1TB RAM dedicated server at https://www.hetzner.com/dedicated-rootserver/ax161/configura... for $600 per month. At Scaleway you can rent it for $1000 per month: https://www.scaleway.com/en/pricing/?tags=baremetal,availabl.... AWS is notoriously hard to be made cost effective.

You can also "leak" rows in a traditional RDBMS or even a filesystem. Why is this particular notable for Redis?
Redis starts to have issues at high scale, even on sophisticated hardware, that can be quite difficult to debug without a lot of additional effort and storage. It’s not just memory, but odd behavior (e.g. randomly dropped connections) with a lot of connected clients, or hot keys/nodes in a cluster configuration, etc.

These issues can exist in any system, but in my experience it’s especially tough (relatively) to identify and diagnose them with Redis. Once you add lua script usage it can get even worse.

A traditional RDBMs or filesystem is designed for high throughput and concurrency, even if some tasks are blocked on data. Additionally both have options to partition steadily growing things. If needed with old partitions being moved to tape backup while the server continues running.

Redis is a single threaded program acting against RAM whose philosophy is that it does things fast then moves to the next job. If it needs to access memory that got paged to disk, the whole server stops and waits to get it. Nobody can do anything.

Because Redis doesn't have to deal with locking and concurrency, it can run much faster on the same resources. But when concurrency is required, it is stuck because it doesn't have it.