Hacker News new | ask | show | jobs
by rayiner 4637 days ago
The thread test is janky. Most multithreaded allocators optimize for the (common) case that objects are freed by the same thread that creates them. When objects are freed by different threads than the ones that allocated them, typically some sort of slow-path is invoked.

Older allocators with per-thread caches used to behave very badly with cross-thread frees, accumulating tons of freed objects in threads that didn't necessarily allocate a lot of objects. Tcmalloc uses a garbage collection process to move those objects back to the central free list.

The test in the article, where one thread does all the allocations and another does all the frees, basically subverts the thread-caching in tcmalloc, and just tests how quickly the garbage collection process can move freed objects from the free()-thread's cache back to the central heap where they can be reused by the malloc()-thread.

1 comments

I admit that test stress some corner cases (at least some cases that the allocator designer consider as corner cases). That said, malloc has no choice but supporting that use case.

A use case for such pattern is a message-posting with workers: you queue some messages that are later unqueued and processed by a different thread. This is an increasingly common pattern in modern programs. In that pattern the message is allocated in one thread (let say the main one) and processed then deallocated by another thread.

If your implementation of message allocation is malloc-based, then you will stress the exact same code paths the benchmark is stressing.

You're not wrong that malloc-based message passing causes that load on malloc, but if performance of the message-passing code is important, you'd want to use a ring buffer anyway - cross-CPU or not, malloc is pretty slow.
Clearly, we go back to the initial statement: for specific use cases, we need specific allocators.