Hacker News new | ask | show | jobs
by tsimionescu 693 days ago
Almost all GCs used in practice today only scan the set of live objects, which in normal operation is much smaller than the entire heap. They also allow much more efficient allocation and de-allocation.

The problems with GC are threefold, and why you might not want it in a systems language:

1. GC requires more memory than strictly necessary to be efficient (usually about 1.5x - 2x the amount you absolutely need). You're basically trading runtime efficiency for memory.

2. GC performance is harder to predict and reason about than certain other allocation strategies

3. GC languages tend to encourage excessive heap allocation for various reasons, ending up with much more junk than a typical Rust or C program that has a similar amount of entities

Note that item 2 is the one that's least understood. The best part about GCs is that they make heap allocation trivial, and they make de-allocation a no-op. In contrast, both malloc() and free() are extremely complex and costly operations. The GC does impose a cost on every pointer write, similar to (but typically less than) the overhead of Arc<T> over a T*, but that has a very uniform and predictable cost. The problem of unpredictability only comes in the collection phase, and is mostly related to (a) when the collection happens, (b) how much data actually has to be scanned (how many live objects are present on the heap and stack), and (c) what type of collection needs to happen (is it enough to collect from this thread's young generation, or do you need to collect all generations from all threads).

Note that many of these problems are in fact solvable, and there actually exist GCs with constant predictable collection times, suitable even for realtime applications (which malloc/free don't support). They are very sophisticated technology that no one is distributing for free though (e.g. you have to pay Azul for a realtime-compatible Java).

4 comments

People often forget that the runtime of free is not trivially calculatable!

I've worked on tiny embedded systems (.net micro framework) where for a given usage pattern the GC was perfectly predictable, as it should be.

What do you mean?

Assuming you use a set of slabs of fixed size objects and keep free objects in a linked list, both malloc and free are trivial O(1) operations.

Destructors with cascading deletions can take time bounded only by memory allocation, but you can solve that for instance by destroying them on a separate thread, or having a linked list of objects to be destroyed and destroying a constant number/memory size of them on each allocation.

You still need locking to make this work in a multithreaded environment, or at least atomics. And all malloc implementations used today are more complex than this, especially when allocating large objects, because you can't actually maintain a list for every possible size of allocation. That means they need to do extra work to handle fragmentation.

Plus, the free list has to occasionally be walked to actually free pages back to the OS. If you don't, then memory is never freed by free(), it is only marked for potential reuse.

There are several popular implementations of malloc (and their corresponding free), and they are all quite complex and have different tradeoffs. And, in fact, none of them is any more suitable for high performance or realtime code than a GC is. The golden rule for writing this type of code is to just never call malloc/free in critical sections.

And I will note that probably the most commonly used malloc, the one in GNU's glibc, actually uses Linux locks, not atomics. Which means that you are virtually guaranteed to deadlock if you try to use malloc() after fork() but before exec() in a multi-threaded process.

> Destructors with cascading deletions can take time bounded only by memory allocation, but you can solve that for instance by destroying them on a separate thread, or having a linked list of objects to be destroyed and destroying a constant number/memory size of them on each allocation.

Which means calculating how long free going to take for a given object is incredibly difficult. Which is my entire point. Of course it's not non-deterministic unless someone 's destructor call s some random code but it is not easy to calculate and in the real world is basically unpredictable.

Spitting up a separate thread and doing an unknown amount of work is no more predictable than doing on the main thread.

Fundamentally destructors can do whatever the hell they want which means you never know how long destruction is going to take.

And of course if you're freeing up an object that has a linked list inside of it and you need to clear out the link list, now you're jumping all around memory and that's never fun, and the run time sure as heck is not constant based on what needs to be paged in and out and what exists in cache lines where.

I'm not saying these problems are unsolvable for a given use case and there are good reasons why custom allocators abound throughout the industry, My main point is that manual memory management does not necessarily lead easy to predict run times for freeing up memory.

This especially true because a lot of developers think that malloc and free are just magic somehow.

In reality, garbage collectors often aren't that complicated, and when it comes to large complicated user applications, you're going to be spending a lot of time in the allocator and deallocator no matter what memory management schema you choose to use.

(Edit: and then there's fragmentation which is the death of many manually managed memory systems! A naively Java or c-sharp application can run for far longer than a naively written C++ application!)

But having such a naive malloc implementation would put it well below what a state-of-the-art tracing GC can do in allocation speed, and then we didn’t even get to fragmentation, elegantly solved by moving GCs. Of course, all these are tradeoffs, I’m just saying that it isn’t as easy.
> Destructors with cascading deletions can take time bounded only by memory allocation, but you can solve that for instance by destroying them on a separate thread, or having a linked list of objects to be destroyed and destroying a constant number/memory size of them on each allocation.

You can run your garbage collector on a separate thread, too. Or use a real time garbage collector.

'Costly' doesn't seem to require that O(k) be non-constant:

http://www.gii.upv.es/tlsf/

> 2. GC performance is harder to predict and reason about than certain other allocation strategies

I'm not sure what you mean by this being the least understood. It seems like it's very well understood: GC introduces latency if you want good throughput, or it reduces throughput if you want excellent latency (sub-100us is possible).

Of course, that doesn't mean you can predict what specific throughput or latency properties your specific program will have, except for GCs that have maximum bounded latency guarantees.

The point is that it's very hard to predict how much of an impact the GC will have a priori. You can of course measure after the fact, and try to improve, but it's hard to architect your system in a way that you can more or less guarantee will get good GC performance (other than just not allocating any memory, of course). Malloc actually suffers from a similar problem: it's just hard to know a priori if your access patterns will work well with the internals of your allocator/GC.
IBM Metronome and similar approaches provide very strong guarantees (if you keep a certain bound on garbage creation ofc) at the expense of throughput.

What they lose is that they introduce also a certain guaranteed loss of available cpu time, due to optimizing for latency and real-time predictability.

Respecting generational hypothesis is a good step towards engineering a GC-friendly design. Mind you, this brings back the necessity to reason about object lifetime, something a GC-based language absolves you from, but it does help. It tends to teach you a lesson that sometimes higher allocation count of objects that die in Gen 0 is preferable to fewer larger allocations which have higher chance of surviving until Gen 1 (.NET has three main generations for GC objects - 0 and 1 aka ephemeral and 2 aka long-lived, there are also LOH, POH and NonGC-heap).
100us counts as excellent latency? x) that's half a million CPU cycles!
That's fast enough for real-time audio, so yes, that's excellent. And that includes compaction and only a small hit to throughput. You can get sub-2us latency at a serious hit to throughput.
> You're basically trading runtime efficiency for memory.

What do you mean? Aren't GCs both less efficient at run time and use more memory?

No, the more memory you give them, the more efficient they are, and generally they are much more efficient than an equivalent program using malloc() + free() for every piece of memory that gets allocated.

Take the extreme case of a program where every allocation is permanent. The GC allocation will be much much faster than malloc() (GC allocation is normally just bumping a pointer to the end of the heap, while malloc() typically does a lot more work to segregate objects by size etc), and collection will never run. So, overall, the program will be significantly faster.

Edit: More realistically, a program that typically produces 1:1 live mmeory:junk, but occasionally spikes, will consume 0 GC time in normal operation (the GC will just let the junk sit there), while free() needs to run on all that junk as soon as it was created, otherwise it leaks forever.

Also, the fact that GCs can defer collection to later than the object going out of scope can often make them more efficient than a traditional free(). For example, if you have a large graph data structure that is now junk, free() needs to do work that's O(size of graph) to free all the nodes. A copying collector, the most popular design, never even looks at the graph, so it frees it in O(1). Edit: of course, the converse is also true: if the graph survives N collections, the copying GC has to do O(N * size) work, where malloc()/free() do nothing at all.

This is in fact heavily related to "ancient" Unix approach, and partially why it optimized so hard for many small processes.

PDP-11 unix (without split instruction/data space) had 32kB of memory for userland process, total, plus another 24kB for kernel (8kB are reserved for I/O devices).

The memory management interface that gave us malloc() and free() ultimately boiled down to brk() and sbrk() which simply set a pointer describing where stack and heap met in memory. Using many small programs, you'd have as little as possible of that 32kB used by code, and prefer to simply allocate using bump pointer and never free (better reuse object pools, which is also a technique used with GCs). A program in a pipeline would mostly allocate some stuff at the start plus buffers, and while some of the early bits might become garbage it doesn't really matter because you can't share parts of your 32kB block (or at best you can use 8kB segments). So you do "missile-style GC" and just use a bump-pointer allocator (brk() + sbrk() + sizeof combo), then exit to OS when your work is done and the entire memory space gets cleaned.

EDIT: Some GCs (like in later versions of Symbolics Lisp Machines, or presently in SBCL behind some experimental options) provided multiple arena support explicitly to be able to force-declare everything allocated in an area as garbage.

GC does force you into a memory layout for your objects that negates any advantage in the allocation routines proper.
I'm guessing you're referring to the fact that the most popular GC languages allocate almost every object on the heap, even when used as a field of another object. This is by no means a constraint of the GC, it is a separate design choice.

And in fact C# has always supported "value objects", which do get allocated in-place, Java is adding the same support probably in the next release, and Go actually defaults to it.

No it doesn't. GCs need to be able to traverse objects. They don't impose object layout constraints.
Only if speaking about languages without system programing capabilities alongside their GC features.
Worst case, 2× is actually a steal compared to a non-moving allocator such as malloc/free or RC built on top of that, which cannot[1] do better than about 1 + ½log₂(largest allocation / smallest allocation). For example, if you have allocations from 1K to 16K bytes, any malloc/free implementation can require at least 3× the memory you actually use; if from 16 to 16K, 6×; etc.

At this point I must mention that paged memory can get you just enough movability that catastrophic fragmentation of this kind is avoided[2] with high probability. Paging tricks can be seriously expensive on a modern multicore processor, though, so I’m not sure this is the way forward. (The paper report only 1% perf overhead for Firefox; I don’t know if that means Firefox is less multithreaded than it could be or what.)

[1] https://www.sqlite.org/malloc.html

[2] https://github.com/plasma-umass/mesh

For anyone following the discussion: Mesh (https://github.com/plasma-umass/mesh) is a seriously interesting read.
It depends what you mean by “efficient”: a world that controls when someone can free memory can be surprisingly efficient at dealing with the fundamental fragmentation problem in terms of overall churn over time (throughout) at the cost of space and immediate time (latency). Both are different forms of efficiency.

Manual memory management as a solution to the fragmentation problem trades that off, not knowing anything about when free might be called, and so has to lean towards optimising space and immediate time rather than throughout. But there’s still a memory manager behind the scenes that has to deal with fragmentation as well; there’s no get out of jail free card for that, and that complexity is still hidden.

(Helpful memory usage disciplines like arenas/pools have their desirable properties for the same reasons: it’s a discipline on when you free memory in order to avoid fragmentation.)

They are massively faster (as in, has better throughput) than ref counting. See my previous comment for more details.
Faster than ref counting AND than manual memory management. Papers that have recorded the traces of memory allocation and liveness info and then replayed them using GC or by inserting manual new/free show GC has a throughput advantage here.
Side question: does Azul still sell custom silicon/designs to accelerate JVM?