Hacker News new | ask | show | jobs
by kstrauser 370 days ago
I’ve wondered about this before but never when around people who might know. From my outsider view, jemalloc looked like a strict improvement over glibc’s malloc, according to all the benchmarks I’d seen when the subject came up. So, why isn’t it the default allocator?
5 comments

It is on FreeBSD. :P Change your malloc, change your life? May as well change your libc while you're there and use FreeBSD libc too, and that'll be easier if you also adopt the FreeBSD kernel.

I will say, the Facebook people were very excited to share jemalloc with us when they acquired my employer, but we were using FreeBSD so we already had it and thought it was normal. :)

Disclaimer: I'm not an allocator engineer, this is just an anecdote.

A while back, I had a conversation with an engineer who maintained an OS allocator, and their claim was that custom allocators tend to make one process's memory allocation faster at the expense of the rest of the system. System allocators are less able to make allocation fair holistically, because one process isn't following the same patterns as the rest.

Which is why you see it recommended so frequently with services, where there is generally one process that you want to get preferential treatment over everything else.

The only way I can see that this would be true is if a custom allocator is worse about unmapping unused memory than the system allocator. After all, processes aren't sharing one heap, it's not like fragmentation in one process's address space is visible outside of that process... The only aspects of one process's memory allocation that's visible to other processes is, "that process uses N pages worth of resident memory so there's less available for me". But one of the common criticisms against glibc is that it's often really bad at unmapping its pages, so I'd think that most custom allocators are nicer to the system?

It would be interested in hearing their thoughts directly, I'm also not an allocator engineer and someone who maintains an OS allocator probably knows wayyy more about this stuff than me. I'm sure there's some missing nuance or context or which would've made it make sense.

I don't think that's really a position that can be defended. Both jemalloc and tcmalloc evolved and were refined in antagonistic multitenant environments without one overwhelming application. They are optimal for that exact thing.
> Both jemalloc and tcmalloc evolved and were refined in antagonistic multitenant environments without one overwhelming application. They are optimal for that exact thing.

They were mostly optimised on Facebook/Google server-side systems, which were likely one application per VM, no? (Unlike desktop usage where users want several applications to run cooperatively). Firefox is a different case but apparently mainline jemalloc never matched Firefox jemalloc, and even then it's entirely plausible that Firefox benefitted from a "selfish" allocator.

Google runs dozens to hundreds of unrelated workloads in lightweight containers on a single machine, in "borg". Facebook has a thing called "tupperware" with the same property.
I think Tupperware was rebranded to Twine sometime about 6-7 years ago.
It's possible that they were referring to something specific about their platform and its system allocator, but like I said it was an anecdote about one engineer's statement. I just remember thinking it sounded fair at the time.
The “system” allocator is managing memory within a process boundary. The kernel is responsible for managing it across processes. Claiming that a user space allocator is greedily inefficient is voodoo reasoning that suggests the person making the claim has a poor grasp of architecture.
There are shared resources involved though, for example one process can cause a lot of traffic in khugepaged. However I would point out that is an endemic risk of Linux's overall architecture. Any process can cause chaos by dirtying pages, or otherwise triggering reclaim.
That’s generally true of any allocator and assuming glibc’s behavior would help mitigate this is critically not something kernel engineers design around nor something glibc allocator is trying to achieve as a design goal.
For context, the "allocator engineer" I was talking to was a kernel engineer - they have an extremely solid grasp of their platform's architecture.

The whole advantage of being the platform's system allocator is that you can have a tighter relationship between the library function and the kernel implementation.

I’m not generally aware of any system allocator that’s written hand in glove with the kernel’s allocator or somehow interops better for overall system efficiency at the cost of behavior in-app. Care to provide an example?
The "greedy" part is likely not releasing pages back to the OS in a timely manner.
That seems odd though, seeing as this is one of the main criticisms of glibc's allocator.
These allocators often have higher startup cost. They are designed for high performance in the steady state, but they can be worse in workloads that start a million short-lived processes in the unix style.
Oh, interesting. If that's the case, I can see why that'd be a bummer for short-lived command line tools. "Makes ls run 10x slower" would not be well received. OTOH, FreeBSD uses it by default, and it's not known for being a sluggish OS.
For a long time, one of the major problems with alternate allocators is that they would never return free memory back to the OS, just keep the dirty pages in the process. This did eventually change, but it remains a strong indicator of different priorities.

There's also the fact that ... a lot of processes only ever have a single thread, or at most have a few background threads that do very little of interest. So all these "multi-threading-first allocators" aren't actually buying anything of value, and they do have a lot of overhead.

Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)

> Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)

Possibly more work since the kernel can't use SIMD

Why is that? Doesn't Linux use SIMD for the crypto operations?
Allowing SIMD instructions to be used arbitrarily in kernel actually has a fair penalty to it. I'm not sure what Linux does specifically, but:

When a syscall is made, the kernel has to backup the user mode state of the thread, so it can restore it later.

If any kernel code could use SIMD registers, you'll have to backup and restore that too, and those registers get big. You could easily be looking at adding a 1kb copy to every syscall, and most of the time it wouldn't be needed.

Why is that? Couldn’t there be push_simd()/pop_simd() that the syscall itself uses around its SIMD calls?

If no syscalls use SIMD today, I’d think we’re starting from a safe position.

push_simd/pop_simd exist and are called kernel_fpu_begin/kernel_fpu_end. Their use is practically prohibited in most areas and iiuc not available on all archs, but it's available if needed.
It's not so much that you can't ever use it, it's more a you really shouldn't. It's more expensive, harder to use and rarely worth it. Main users currently are crypto and raid checksumming.

https://www.kernel.org/doc/html/next/core-api/floating-point...

That’s actually particular try to alternate allocators and not true for glibc if I recall correctly (it’s much worse at returning memory).
As far as I know there is no technical reason why jemalloc shouldn't be the default allocator. In fact, as pointed out in the article, it IS the default allocator on FreeBSD. My understanding is it is largely political.
Now that I think about it, I could easily imagine it being left out of glibc because it doesn't build on Hurd or something.
> I could easily imagine it being left out of glibc because [...]

... its license is BSD-2-Clause ;)

hence "political"

Huh? Bsd-style licenses are fully compatible with gpl.

The problem is exactly this: Facebook becomes the upstream of a key part of your system.

And Facebook can just walk away from the project. Like it did just now.

They are compatible but that's not the point.

If it were included it would instantly become a LGPL hard-fork because of any subsequently added line of code, if not by "virality" of the glibc license, at least because any glibc author code addition would be LGPL, per GNU project policy/ideology.

Also also this would he a hard bar to pass: https://sourceware.org/glibc/wiki/CopyrightFSForDisclaim

As I recall this is what prevented Apple from contributing C blocks† back to upstream GCC.

https://github.com/lloeki/cblocks-clobj

What prevents apple from working with gpl-style licenses is strict hatred towards code that they can't use without opensourcing it. So this is what prevents them from contributing to gpl projects: the need to control access to code.

Llvm is OK for them from this point of view: upstream is open but they can maintain and distribute their proprietary fork.