Hacker News new | ask | show | jobs
by ZGF4 3268 days ago
There are a few misguided views in this article and in some of these comments.

1. Every shardable database (Cassandra, Dynamo, BigTable) has to worry about hot spots. Picking a UUID as a partition key is only step one. What happens if one user is a huge majority of your traffic? All of their reads/writes are going to a single partition and of course you are going to suffer from performance issues from that hot spot. It becomes important to further break down your partition into synthetic shards or break up your data by time (only keep a day of data per shard). BigTable does not innately solve this, they may deal better with a large partition but it will inevitably become a problem.

2. Some people are criticizing the choice of NoSQL citing the data size. Note you can have a small data size but have huge write traffic. An unsharded RDBMS will not scale well to this since you cannot distribute the writes across multiple nodes. Don't assume just because someone has a small data set they don't need to use NoSQL to deal with their volume

3 comments

> Every shardable database (Cassandra, Dynamo, BigTable) has to worry about hot spots.

Yeah, but the issue with DynamoDB seems to be bursts of access triggering "throughput exceptions" caused a very static bandwidth allocation which is going down with the number of shards and not so graceful handling of overload situations.

It is imho. an anti-pattern to split up the bandwidth like they do. It negates the multiplexing gain for no good reason except a rigid control model.

If one user is the majority of your traffic, you don't shard on user ID.
It depends a lot of the write. If they can be batched, you can put them in a queue or in redis until it reach a threshold and write the update down in the RDBMS. It won't work for all the use cases, but more often that people think.
If you have a persistent data problem, and introduce redis, then you'll have two persistent data problems.
Yes microbatching is a great way to get a fixed write rate regardless of traffic, can even do it right at the application layer. The trade-off is that you can theoretically lose one interval of data when your service goes down. This might not matter for analytics workloads with a margin of error but some usecases require confirmed writes