Hacker News new | ask | show | jobs
by derefr 2313 days ago
Erlang has a parameter called initial_heap_size. Each new actor-process in Erlang gets its own isolated heap, for which it does its own garbage-collection on its own execution thread. This initial_heap_size parameter determines how large each newly-spawned actor’s heap will be.

Why would you tune it? Because, if you set it high enough, then for all your short-lived actors, memory allocation will become a no-op (= bump allocation), and the actor will never experience enough memory-pressure to trigger a garbage-collection pass, before the actor exits and the entire process heap can be deallocated as a block. The actor will just “leak” memory onto its heap, and then exit, never having had to spend time accounting for it.

This is also done in many video games, where there is a per-frame temporaries heap that has its free pointer reset at the start of each frame. Rather than individually garbage-collecting these values, they can all just be invalidated at once at the end of the frame.

The usual name for such “heaps you pre-allocate to a capacity you’ve tuned to ensure you will never run out of, and then deallocate as a whole later on” is a memory arena. See https://en.wikipedia.org/wiki/Region-based_memory_management for more examples of memory arenas.

4 comments

The games and GPU apps I’ve worked on use memory pools for small allocations, where there will be individual pools for all, say, 1-16 byte allocations, 16-64 byte allocations, 64-256 byte allocations, etc. (Sizes just for illustration, not necessarily realistic). The pool sizes always get tuned over time to match the approximate high water mark of the application.

I think pools and arenas mean pretty much the same thing. https://en.wikipedia.org/wiki/Memory_pool I’ve mostly heard this discussed in terms of pools, but I wonder if there’s a subtle difference, or if there’s a historical reason arena is popular in some circles and pool in others...?

I haven’t personally see a per-frame heap while working in console games, even though games I’ve worked on probably had one, or something like it. Techniques that I did see and are super-common are fixed maximum-size allocations: just pre-allocate all the memory you’ll ever need for some feature and never let it go; stack allocations sometimes with alloca(); and helper functions/classes that put something on the stack for the lifetime of a particular scope.

I’ve always understood an arena to use a bump pointer for allocation and to support only a global deallocation, as the GP describes.

A pool or slab allocator separates allocations into one of a range of fixed-size chunks to avoid fragmentation. Such allocators do support object-by-object deallocation.

I've seen Jason Gregory talk about per frame arenas in Game Engine Architecture as a fundamental piece of how the Naughty Dog engines tend to work.

Totally agreed that they aren't required for shipping great console games (and they're really hard to use effectively in C++ since you're pretty much guaranteed to have hanging references if you don't have ascetic levels of discipline). This is mainly just meant as a "here's an example of how they can be used and are by at least one shop".

Seems like this could be handled with a wrapper type with runtime checks during debug?

Like make any pointer to the per frame allocation be a TempPointer or something and then assert they're all gone with a static count variable of them? Then you just have to be cautious whenever you pass a reference to one or convert to a raw pointer.

I don't think this would be too awful for performance in debug builds.

Yeah, or a generation system where the pointer holds a frame count too that's asserted on deref.

The point though is that it's a step back still from shared_ptr/unique_ptr by becoming a runtime check instead of compile time.

So I kind of disagree with the idea that arenas are all about deallocation at once. There's other contexts where you have separate arenas but don't plan on deallocating in blocks, mainly around when you have memory with different underlying semantics. "This block of memory is faster, but not coherent and needs to be manually flushed for DMA", "this block of memory is fastest but just not DMA capable at all", "there's only 2k of this memory, but it's basically cache", "this memory is large, fairly slow, but can do whatever", "this block of memory is non volatile, but memory mapped", etc.

I'd say that arenas are kind of a superset of both what you and I are talking about.

I can’t remember the last time I read C code, but I do recall a particular time when I was reading a library that had been written with a great deal of attention to reliability. The first thing it did was allocate enough memory for the shutdown operations. That way on a malloc() failure, it could still do a a completely orderly shutdown. Or never start in the first place.

From that standpoint, you could also categorize arenas on a priority basis. This one is for recovery operations, this one for normal operation, and whatever is left for low priority tasks.

> The first thing it did was allocate enough memory for the shutdown operations.

That is clever and beautiful. Have to look for chances to do similar to see if I can establish a new habit myself.

That strategy is more important on systems that don’t do demand paged virtual memory. In Think Class Library on classic Mac OS, it was called the “Rainy day fund”.

One can also do that in stages:

- allocate a large block at startup

- when running out of memory, reallocate it at a smaller size and warn the user

- when running out of memory again, free the block and attempt an orderly shutdown.

Those aren't arenas. I'm inclined to agree with Wikipedia's definition, which does emphasise deallocation all at once:

> A region, also called a zone, arena, area, or memory context, is a collection of allocated objects that can be efficiently deallocated all at once.

I mean, wiki uses zone and region here as synonyms, so according to wiki that definition applies just as much. And yet:

https://www.kernel.org/doc/gorman/html/understand/understand...

Like, as a embedded developer, these concepts are used pretty much every day. And in a good chunk of those, deallocation isn't allowed at all, so you can't say that the definition is around deallocation at once.

You can also see how glibc's malloc internally creates arenas, but that's not to deallocate at once, but instead to manage different locking semantics. https://sourceware.org/glibc/wiki/MallocInternals

can be efficiently deallocated all at once.

There are implementations of arenas that, basically, act as separate memory managers. You can allocate and free memory inside them at will, but can also deallocate the whole thing in one sweep. The latter can be a lot faster, but of course, it requires you to know you can throw away all that memory in one go (handling web requests is the exemplar use case)

Note that most general-purpose allocators also keep around internal arenas from which they hand out memory.
Not sure how this is related. A general purpose allocator with a plain malloc interface can’t use this to do anything useful wrt lifetime because there is no correlation to lifetime provided by the interface. Internal arenas can be useful to address contention and fragmentation.
I'm pointing out that an arena is more about "a region of memory that you can split up to use later" than "a region of memory that must be allocated and deallocated all at once".
I've never heard that use of the term "arena". Are you thinking of slabs? Arenas are typically allocated and deallocated at once. That's their main feature.
Seems like a bit of an ideosyncratic use of the word. In tcmalloc these per thread zones are just called the thread cache (hence the name "thread caching malloc").
> memory allocation will become a no-op (= bump allocation)

No, that's a cache miss.

No, as memory is allocated linearly the cpu prefetchers will most likely keep the heap in cache.
No, in practice this is not true. You will also need to write out to memory all the newly allocated objects which you don't need anymore.

https://podcasts.apple.com/us/podcast/hidden-gc-bandwidth-co...

How is that related to memory allocation costs? We are talking about the cost of obtaining a chunk of memory. The cost of actually creating an object is allowed to be much higher because constructors are allowed to execute any arbitrary code.

Just think about how expensive it would be to allocate a 3d vector consisting of 3 floats with malloc() 20000 times and then later deallocate it. Nobody is worrying about the cost of writing 3 floats to RAM. Everyone is worrying about the cost of malloc traversing a free list and it causing memory fragmentation in the process. Meanwhile the arena allocator would be at least as efficient as using an array of 3d vectors.

> How is that related to memory allocation costs?

It's a cost that has to be paid when using bump pointer allocation.

> We are talking about the cost of obtaining a chunk of memory. The cost of actually creating an object is allowed to be much higher because constructors are allowed to execute any arbitrary code.

Accessing main memory is about two orders of magnitude slower than accessing L1. For that time you can run a lot of arbitrary code that accesses data in L1 and registers.

> Just think about how expensive it would be to allocate a 3d vector consisting of 3 floats with malloc() 20000 times and then later deallocate it. Nobody is worrying about the cost of writing 3 floats to RAM. Everyone is worrying about the cost of malloc traversing a free list and it causing memory fragmentation in the process. Meanwhile the arena allocator would be at least as efficient as using an array of 3d vectors.

malloc doesn't mandate free lists, other implementations exist. It's not about relative costs. OP claimed bump pointer allocation to be a "no-op" when it's clearly not.