Hacker News new | ask | show | jobs
by didymospl 1911 days ago
This might be only slightly related to this post but I started wondering: how often do you have to tweak GC in your job? Do you ask/are you asked questions about GC on interviews?

I've just realized that I was asked about this stuff in almost every Java interview I had, sometimes the questions were very detailed (and I was nowhere near HFT or any other real-time systems, GC pauses were minor concerns) but for jobs focused on other languages this topic is almost completely skipped.

7 comments

Maybe only a little related, but in the five-ish years of ruby dev I've done, there was only one time I can remember interacting with the GC directly in production code.

It was in the context of a sidekiq job that was importing customer data via csv file. We would read in the csv, and for each row a lot of complicated logic was being performed that would translate the data from customer format into our format, and decide how to update different tables in db. These files were sometimes 10k lines or longer (all handled by a single sidekiq job), and would balloon up in memory so much that sidekiq would crash and would keep trying to restart the job. For each row we were instantiating an ActiveModel object that had a lot of attributes/functions. I think the right solution would have probably been to do a (fairly heavy) refactor of that area in the code, and spin up a separate job for each row, but we found that by running a GC.start every few rows we were able to cleanup some of the old AM objects and keep the memory usage low for the time being...

Mirrors my experience as well. Nearly a decade of working on large Ruby/Rails apps, some with very complex reporting / data processing flows (talking like billions of db rows processed in streaming queries, media encoding, etc) and a particular CSV processing situation like yours was the only time I needed to manually trigger GC... and even that was just triggering it, not even tweaking it.

The defaults seem very good, even at scale.

Part of that is because Ruby has a very predictable, but slower, GC. Java on the other hand has multiple memory managers... some optimized for high throughput/spikes, but are much harder to predict.
That's interesting!

I've been doing Ruby since 2014. Mostly Rails, but also a bunch of data processing.

I have run into memory issues at times, when shuffling large amounts of data around. But manually running GC was never the answer in my cases.

In all cases, the memory issues were because I'd created a bunch of heavy objects that were still in-scope and were therefore not eligible to be cleaned up by GC anyway.

This was all Ruby 2.0+ and most of the heavy data processing stuff was 2.3+. So I wasn't doing any of it back in the days of really ancient Ruby GC.

I've done a lot of similar work and learned a lot of similar lessons. They were interesting and fun challenges but I've since moved on from Ruby in my professional life.

I'll say this much: When I was working on these applications, one of the minor wins that I had was swapping them over to the jemalloc memory allocator. It has introspection/instrumentation tooling that is really useful for these sorts of situations. You can use `MALLOC_CONF` [0] to trigger some built-in profiling. For instance, `export MALLOC_CONF='prof_leak:true,lg_prof_sample:0,prof_final:true'` will trigger jemalloc to log the heap at exit which is very useful for tracking down leaks.

[0]: https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Leak-C...

I work mostly on internal Rails apps so the need for fine-tuning GC is basically non-existant.

I think the app that needed to be the most performant ended up using JRuby :P

I can't wait for truffleruby to be a thing.

It's anecdata, but JVM tuning comes up far more in Java related conversations I've had than the equivalent in other languages/environments. I'm not enough of a Java expert to fully appreciate why this is.
Well, the JVM has several GCs they have a lot more potential for tuning, and there are many tools to gather data on what the GC is doing and to analyse heap dumps to discover the cause of problems. If you have a language like Ruby which is normally used without a moving GC then there isn’t huge scope for tuning things, but if you have a moving GC then there is a lot you can tinker with regarding region sizes etc.
It seems that Ruby GC is conservative (1st google hit)? That pretty much means giving up on your GC performance optimization..
Part of that is because there are so many options, and since Java is used a lot in software where you care about performance/throughput, you hear about it - just like how you hear about all the different kinds of memory allocators you can write in C.
I ask basic questions in my interviews to see if a person is even aware of GC and potential for object leaks in dynamic languages. We have a pretty standard webapp with a handful of backend services, not HFT, but we did have clueless interns write leaky code that wasted memory and crashed, and I don't need that to happen again, so yes, when a webapp programmer cannot even recall the term garbage collection that's not a great sign for me.

I'm additionally amazed at people who show up at the interview sometimes claiming C and/or C++ experience (completely not required for the role, but hey, they do claim that experience) but then seem to be completely unaware of any basics of memory management.

Ruby 1.8 and 1.9 did get significant benefit if you tuned the GC because the simple GC it used back then was tuned for quick command line startup. More recent versions have a much improved GC that doesn’t need tuning for most cases.
In modern Java you really don't need to tweak anything G1 defaults are typically more than enough. Maybe if your latency sensitive you would switch to ZGC or set the max pause time.
A Java dev will at least tweak the xmx/xms at some point in its career.
I don't know other languages/VMs that requires to set JVM's Xmx/Xms equivalent parameter. Why only JVM requires it? What about to just set unlimited by default?
Generally, you want to leave some memory available to other things, like the OS, buffers in the network stack, etc.

Having a limit for the VM is helpful. Also, by default it is automatically 20% of available memory.

Personally, I've had to tune this kind of stuff with every VM I worked with (JVM, Node, PHP). :)

I'm not a Java dev at all, and I've tweaked more than just those parameters. Simply running Java programs is sufficient that you can end up introduced to the JVM's GC.