| > The graph shows that a Cassandra server instance could spend 2.5% of runtime on garbage collections instead of serving client requests. The GC overhead obviously had a big impact on our P99 latency No, this is not obvious. If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency. It would primarily impact your throughput (by 2.5%), just like any other thing consuming CPU cycles. > We defined a metric called GC stall percentage to measure the percentage of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and could not serve client requests. Again, this metric doesn't tell you anything if you don't know how long each of the pauses are. If they are at the limit infinitesimally small then you are again only measuring the impact on throughput, not latency. Certainly, GCs with long STW pauses do impact latency, but then you need to measure histograms of absolute pause times, not averages of ratios relative to application time. That's just a silly metric. And neither does the article mention which JVM or GC they're using. Absent further information they might have gotten their 10x improvement relative to some especially poor choice of JVM and GC. |
It's pretty much accepted everywhere that GCs perform terribly for databases. Modern GCs are great at handling small, very short-lived memory allocations, and that's about it. Just about any other workload and manual memory management ends up being a much better use of your time than GC tuning.