Hacker News new | ask | show | jobs
by the8472 3025 days ago
> The graph shows that a Cassandra server instance could spend 2.5% of runtime on garbage collections instead of serving client requests. The GC overhead obviously had a big impact on our P99 latency

No, this is not obvious. If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency. It would primarily impact your throughput (by 2.5%), just like any other thing consuming CPU cycles.

> We defined a metric called GC stall percentage to measure the percentage of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and could not serve client requests.

Again, this metric doesn't tell you anything if you don't know how long each of the pauses are. If they are at the limit infinitesimally small then you are again only measuring the impact on throughput, not latency.

Certainly, GCs with long STW pauses do impact latency, but then you need to measure histograms of absolute pause times, not averages of ratios relative to application time. That's just a silly metric.

And neither does the article mention which JVM or GC they're using. Absent further information they might have gotten their 10x improvement relative to some especially poor choice of JVM and GC.

4 comments

you clearly didn't read the post very closely. They said 2.5% of CPU cycles were spent on stop-the-world young generation collections, not on the sum total of all memory mangement. That means that 2.5% of the time the app is entirely stalled on just these collections. Given that stop-the-world pauses are never evenly distributed throughout time, it should be very much expected that this much GC stalling would affect p99 latencies.

It's pretty much accepted everywhere that GCs perform terribly for databases. Modern GCs are great at handling small, very short-lived memory allocations, and that's about it. Just about any other workload and manual memory management ends up being a much better use of your time than GC tuning.

> Given that stop-the-world pauses are never evenly distributed throughout time

That is not a given. And, even distribution is only part of the equation. If they are sufficiently short, then even being somewhat unevenly distributed should not have much of an impact on latency. For example, if the max length of a pause were 1ms, and 99p latency were 15ms, you'd have to be fairly unlucky to see a 33% increase in latency99 due to GC. That would entail 5 of 25 pauses happening during a 20ms period in a 1s window.

(This idea is not purely hypothetical. For example, Go's GC has very low STW periods.)

> It's pretty much accepted everywhere

Eh. Apparently everyone thinks C is the best language for cryptography and other secure but not particularly perf sensitive code. Go figure. Sometimes the wisdom of the masses is not wisdom. Best not to appeal to it during argumentation.

You're right on your first point. I responded a little harshly to OP because I believe they responded too harshly to the blog post. The P99 GC latency data they give is us not sufficient to explain their perf gains. More likely, the metric they were tracking was well-correlated with their perf gains, but not the only or even primary cause. The other perf gains could have been caused by reduction in older generation collection times, for instance.
I'd love to see how Go's GC performs when running an application similar to Cassandra, on multiple cores, with gigabytes of memory allocated.
Go's GC is non-moving, incremental and single-generation. Hotspot GCs are all moving collectors so they make different tradeoffs. IBM's metronome collector would be the closest one to Go's properties.
This is a thing others have already done, although I don't have citations to hand.
So why do people keep building latency sensitive things in the JVM? And then they manage to get hugely popular?

Cassandra is a constant struggle with the GC. I’d guess the cost of running it is at least an order of magnitude greater compared to if it had been implemented in c++ or something more sensible.

To be fair, a lot of big companies know how to tune the JVM. A TON of HUGE companies write a LOT of java. What you consider a constant struggle, a lot of very large companies consider trivial.
I'm not sure it's trivial. Tuning the JVM is an entire cottage industry. JVM performance experts can make 1000+/day tuning the JVM and are in high demand. Companies spend huge amounts of engineering effort to keep the JVM running smoothly. I used to be involved in this side of things pretty heavily at a HFT firm, which almost exclusively used Java.

In my opinion it's a colossal waste of resources. Classic example of using the wrong tool for the job.

And it's still cheaper to hire a guy with a skill like that for a few weeks, or even keep him permanently - and keep a larger development team of cheaper C# or Java devs, than it is to replace them all with higher-payed C/C++ devs which would probably take longer to get the same functionality up & running.
Good Q. You might like to check ScyllaDB written in C++, which is supposed to have considerably better performance than Cassandra (also low tail-latency) and a level of compatibility with it: https://www.scylladb.com/
Many of these open-source databases started as internal projects inside big companies, where Java/JVM allowed for more productivity and cross-platform deployment with more skill reuse of the team. Then they grew from there and now it's too late to rewrite the whole thing.

If you were starting a database-focused company from the beginning than choosing C++ is a better decision, which is exactly what ScyllaDB has done with their cassandra clone. Along with general algorithm and decision improvements, it'll provide 10-100x the same performance at lower latency on the same server.

We're also starting to see more projects written in Go now, which is still a managed runtime but usually better at handling these kinds of low-level systems.

Go is a much better choice for systems work. Largely because the GC has a low pause (sub ms) target. I'd still be hesitant to use it for very latency sensitive things, or memory intensive applications. Prometheus, for example, has struggled with golang's memory management (bad memory fragmentation, wasteful memory usage). But I think it's a great compromise if you don't want to deal with memory management.
Sure, I would also .NET/C# to the list now.

.NET Core on Linux is very fast and there are some great developments around fast low-level (yet managed) managed memory manipulation that can lead to some very fast software.

OpenJDK/Hotspot is not the only game in town. Those who really need low latencies can opt to use other JVMs (some commercial) with GCs that provide very low pause times, usually at the expense of some percentage points of throughput. In large corporate environments that might not be a problem.
Apparently these people enjoy GC/JVM languages more than C++.
GC languages like Java is much easier to write, and can be made performant when required.
Yes, this is the reason why Java is chosen. But I feel pretty strongly that databases are system engineering problems, and should be written using a proper systems language.

Something like Java makes implementation easier, but operation more difficult and costly.

classic hacker news comment.

this thing you built and open sourced, has gotten you real measurable results? allow me to list the many ways you’re probably wrong and doing it incorrectly

Measurable results are all well and good, but it can be helpful to know how the baseline was established. Measurable results aren't "portable" without a well-established baseline.
Both code and benchmark are open sourced. We'd love to hear how it performs for you.
This is a valid criticism of the methodology / explanation. It's not about the results. You can agree with the positive results (and they're great! - you've done awesome work and clearly show an improvement) and still say the explanation how/why they were achieved is not great.
Exactly. Immediately from the premise of the paper, I was looking forward to a discussion on how they tried various strategies to tune their JVM/GC parameters and found nothing. The "well I guess we gotta replace this with a C++ solution" sentiment smacks of poor software engineering practice, despite the results.
That you didn't find it, doesn't mean they didn't do any. Maybe it wasn't worth mentioning. Maybe they wanted to keep the post short. Some team successfully did a storage engine transplant and you're saying they're doing poor engineering?

That's definitely not what I was suggesting in my comment.

> The "well I guess we gotta replace this with a C++ solution" sentiment smacks of poor software engineering practice, despite the results.

On the flip side, recognizing when you are not efficiently using your time to solve a problem with the current approach and deciding to switch to another one is what I would consider a vital part of good software engineering, and also one of the hardest things to become good at without a lot of experience.

I'm not sure that's the case here, but I don't doubt it's possible they could easily have wasted more time testing and tweaking Java than it took to implement this solution. Whether that's how it would have (or possible even did to a degree) played out is an open question.

Good memory management is important for a high performance, low latency DBMS. I don't see how getting rid of GC is bad engineering practice. If a tool is not well suited for a problem, use another tool. Could you please explain your opinion?

//Edit: removed 2nd part, which wasn't really that important anyway...

That might be an argument for not using Cassandra. It’s a pretty big leap to reimpmenting half of Cassandra in c++.
> If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency.

I try to understand the meaning. Is it saying the latency caused be GC is applied to all requests, not just the ones that observe 99th percentile latency?

It's saying that whether it affects latency or just throughput depends on how those pauses are distributed in absolute terms, not just the ratio. There's a big difference in 99th percentile latency between a 1ms pause every 400ms and a 10 second pause every 67 minutes, but they both work out to 2.5% by the ratio metric.

So yes, at the `infinitesimally small` end, time would be 'stolen' evenly from all request threads and would not be a contributing factor to the 99th percentile.

No, that would be an incremental GC working in very small time slices.

A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests. They are still CPU cycles you don't have to serve other requests, hence they still affect throughput.

That is a simplified explanation of course, there are a lot of caveats.

In my original post I was mostly speaking about the measurement though, since they are measuring throughput when they are concerned about latency, those are somewhat related but depending on circumstances only weakly so.

What about the requests that are processed by a core that's doing a GC? Wouldn't that cause a higher P99 latency exactly as you'd expect?

Spending 2.5% of cycles on GC doesn't mean those cycles are perfectly distributed. The distribution of GC work onto cores is bound to have some cores doing more work than others, which would (I would think) manifest itself as some requests that land on an unlucky core getting more latency. Isn't it totally expected that this would cause a P99 latency spike?

Maybe if you operated your system at the saturation point, which you really don't want to do in practice. Instead you want your queues to be mostly empty. Bursts are inevitable but bursts coinciding with GC doing work hopefully is a beyond 99th percentile thing. Of course if we're speaking purely theoretical we could also assume spherical cows in a vacuum and say that requests don't burst and simply arrive in a metronome-like trickle and then the spikes evaporate too. This is basically queuing theory.

And you could also use more CPU cores than request workers, that way you will always have spare core capacity and thus your latency will not be directly impacted. That is if you really really value latency more than throughput.

Again, my main point is that throughput and latency are not the same thing. There is some relation in so far that you cannot fulfill latency promises if your throughput is insufficient and your queues start filling up. But below the saturation point it's a lot more complicated, especially in parallel systems with bursty arrivals.

> A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests.

That makes sense, how about GC for single-threaded languages, e.g. Nodejs?

Just because a language is single threaded in the code you write doesn't mean it doesn't use threads behind the scenes, like for GC or other things.