Hacker News new | ask | show | jobs
by jerf 4761 days ago
I think one of the keys to understanding garbage collection is to understand that it is on a continuum of memory management techniques, and the line is a great deal less bright and shining than people often realize. malloc/free is "manual memory management", right?

Well, not really. Truly manual memory management is getting an enormous block of memory from the OS, and fully manually choosing what goes where within that block of memory. This is indeed a real technique used when the situation is dire enough. If you're not doing that, you're deferring something to your automated solution, the only question is, how much?

malloc/free is a combination that still gives you lots of power, but it's not perfect when you start pushing the abstraction hard enough. All malloc/free combinations have some sort of pathological case, where whatever allocation strategy they are choosing is wrong for some reason. That's why there isn't just one implementation, there's many, and some applications can see big wins switching, while others may see losses.

Garbage collection isn't really some sort of binary dichotomous decision vs. malloc/free; both are forms of automated memory management. Garbage collection techniques just step it up, and try to infer when you are done with memory. The disadvantage is, they may not be as smart as a human, the advantage is that they're a heck of a lot more consistent and pay much better attention. Then, even within "garbage collection", you've got a gradient; you may see a language like Rust with mostly deterministic memory management, but with an easy call-out to GC if you need it. You may see a language like Perl, with relatively consistent finalization as soon as an unshared reference leaves a scope, or you may see something that only collects during sweeps. At the far end you get imprecise garbage collectors, such as those used in Go right now (though they are working on it, and have already made a lot of progress), so even within the realm of GC there's a range of precision vs. speed vs. nuances the programmer needs to know to use them.

GC is necessary because one particular point on this large spectrum isn't the right choice for every program. It isn't even the maximally manual solution. In fact, there's even some middle grounds between malloc/free and fully manual memory management, such as arena allocation. There's a lot of fine gradation on this scale.

"(Also, databases and operating systems can live without whatever makes garbage collection necessary, since they don't use it, and those are pretty complex and fast pieces of software.)"

A bold statement. Have you ever heard the term "compaction" used within the context of a database? That's a form of garbage collection. Contrary to what you said, almost all databases have some form of garbage collection. (Probably all the ones you're thinking about.)

As for whether operating systems have garbage collection, it depends on your precise definition of "operating system". Kernels may lack it, but as critical as they are, and as much sheer staggering code they may have for drivers and stuff, conceptually they actually aren't that complicated, compared to a lot of userland things. Broaden your definition and you'll find some form of garbage collection appear again. And if you include "reference counting" approaches as a form of garbage collection, the Linux kernel uses that all over the place. Is reference counting a form of garbage collection? Well, it can actually be a tough call... because it's all on a continuum.

3 comments

(thank you for the willingness to write such a detailed response)

There exists a point on the memory management continuum where management starts being more automatic (gc) than manual (malloc/free). I would like to understand the forces surrounding this specific point, right before the scale tips towards automatic.

If you tried to build a dynamic language without automatic management, what would break and why?

The biggest problem is scope control; as you start having closures that get passed around freely, those closures drag values along with them that you can't collect. It is not impossible to write this with malloc/free, but I've played that game and it's not very fun. And remember, what seems easy in one little blog post isn't easy in a real program where you've got dozens of the little buggers flying every which way. (And by dozens, I mean dozens of distinct different types of closures from different sources, not just dozens of instance of the same code instance.)

Many of the dynamic languages fling closures around with wild abandon, often without you even realizing it. (One object has a reference to another object which is from the standard library which happens to have a closure to something else which ends up with a reference to your original object... oops, circular reference loop. Surprisingly easy.)

There isn't much technically impossible with malloc/free (though IIRC there are indeed some cases that are both sensible and actually can't be correctly expressed in that framework, but the example escapes me), but there's lots of practical code where the cost/benefit ration goes absurdly out of whack if you're trying to write the manual freeing properly. It's hard to write an example here, because it arises from interactions in a large code base exceeding your ability to understand them. It's like when people try to demonstrate how dangerous old-style pthreading is; even though the problem is exponentially bad in nature, anything that fits in a blog post is still comprehensible. The explosion doesn't happen until you got to real code.

I can see how closures would take some work. Thanks.

There's evidence garbage collection was not the desired solution but a plan B. McCarthy writes the reason reference counting was not implemented was a hardware limitation [1]:

"Since there were only six bits left in a word, and these were in separated parts of the word, reference counts seemed infeasible without a drastic change in the way list structures were represented. (A list handling scheme using reference counts was later used by Collins (1960) on a 48 bit CDC computer).

The second alternative is garbage collection..."

[1] - http://www-formal.stanford.edu/jmc/history/lisp/node3.html#S...

If the language was also dynamically typed, that could also lead to some interesting problems. What is the actual size used to represent a string, or an in, or a double in the language? What happens when you want to compare/convert different types?

Either the language provides these capabilities, in which case it's doing some memory management of it's own, or you write them, in which case it wasn't necessarily all that dynamic to start with.

I imagine it might look like what you get when you try to write a dynamic language in C (Perl, Ruby, Python), a lot of macros that automate what's going on to the degree you are mostly writing some macro based DSL.

"[T]he advantage is that they're a heck of a lot more consistent and pay much better attention."

This implies that the only reason for garbage collection is programmer fallibility. However, I think there are some cases where there is literally no better way of knowing when everyone is done with a piece of memory than by checking (closures strike me as an area this can happen pretty quick), at which point the only solution is automated garbage collection - be it provided by the runtime, a library, or written by the programmer for the particular program.

> This is indeed a real technique used when the situation is dire enough.

Like console video games. Rarely there is malloc/free type of allocations, and most of them are for some third-party API, and often there are ways to avoid it... or the system itself (sometimes C CRT would do that beneath you, sometimes obvious - strdup, sometimes almost - fopen/fclose, and sometimes unexpected - for example certain *printf functions might do it).