| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rbranson 4851 days ago

First, let me concede that Cassandra has had a storied history of terrible read performance. However, if the last time anyone looked at Cassandra for read performance was 0.8 or used size-tiered compaction, I'd encourage them to take another look.

The p95 latency issues were largely caused by GC pressure from having a large amount of relatively static data on-heap. In 1.2, the two largest of these: bloom filters and compression data were moved off-heap. It's my experience that with 1.2, most of the p95 latency is now caused by network and/or disk latency, as it should be.

I'm not going to compare it with other data stores in this comment, but I'd encourage people to consider that Cassandra is designed for durable persistence and larger-than-RAM datasets.

As far #4, this is mostly false. Tombstones (markers for deleted rows/columns) CAN cause issues with read performance, but "issues while GC'ing large number of tombstones" is a bit of a hand-wavey statement. The situation in which poor performance would result from tombstone pile-up is if you have rows where columns are constantly inserted and then removed before GC grace (10 days). Tombstones sit around until GC grace, so effectively consider data you insert to live for at least 10 days, unless of course you do something about it.

Usually people just tune the GC grace, as it's extremely conservative. It's also much better to use row-level deletes if possible. If the data is time-ordered and needs to be trimmed, a row-level delete with the timestamp of the trim point can improve performance dramatically. This is because a row-level tombstones will cause reads to skip any SSTables with max_timestamp < the tombstone. It also means compaction will quickly obsolete any succeeded row-level tombstones.

Here's a graph of P99 latency as observed from the application for wide row reads (involving ~60 columns on average, CL.ONE) from a real 12-node hi1.4xlarge Cassandra 1.2.3 cluster running across 3 EC2 availability zones. The p99 RTTs between these hosts is ~2ms.

http://i.imgur.com/WRdps3B.png

This also happens to be on data that is "ephemeral" as our goal is to keep it bounded at ~100 columns. The read:write ratio is about even. It has a mix of row and column-level deletes, LeveledCompactionStrategy, and the standard 10 day GC grace.