|
|
|
|
|
by vardump
4006 days ago
|
|
Well, while I see the advantages of manual memory management as an embedded and kernel driver developer, one should understand manual memory allocation and freeing are very expensive and unpredictable operations. Very, very far from "zero overhead". There's usually no way you can afford to allocate memory while processing an interrupt request! It's simply too unpredictably slow. (Not to mention many synchronization mechanisms memory management usually requires, like mutexes, are simply not feasible at IRQ level. No process switching is possible, so a deadlock occurs.) But some GC schemes could be fast enough even from an IRQ handler, if memory allocation is just something simple like an atomic add to top of heap pointer. As long as non-interrupt level routines, potentially running on an another core, have enough time to clean up the garbage. Manual memory management seems to be at the end of its road when it comes to high core count systems, with tens or hundreds of CPU cores. Allocation and application side object lifetime synchronization and management will simply saturate any inter-core communication mechanism, limiting scalability. GC should be able to get around that limitation. At that scale, you could already dedicate one or more cores just for cleaning garbage. |
|
https://fossies.org/dox/glibc-2.21/malloc_8c_source.html#l02...
Also see this benchmark between different, faster allocators: http://locklessinc.com/benchmarks_allocator.shtml
The extra work doesn't end at allocation. When you actually implement a concurrent system, you'll usually end up having corner cases at object lifetime changes, which you need to synchronize. If the memory is only claimed when there are no more references to it, this extra synchronization step can be avoided. If you can't rely on this, you'll probably end up doing synchronization, such as (atomic) reference counting.
Synchronization is very expensive and it can quickly become the performance bottleneck for the whole system. On modern X86, you can do 5-20k floating point operations during one contended atomic sync op. Reference count increase or decrease is one sync op. A simple mutex needs two of those.
The more you have CPU cores, the more there will be synchronization (cache coherence) traffic broadcasted to all cores.
Words like "JIT" and "GC" seem to cause knee-jerk reactions in some developers. Likewise for manual memory management. It's not so black and white. There'll always be trade-offs. I usually write low level (firmware and kernel driver) and high performance code. C/C++/SIMD. Code that might need to react under a microsecond.
My message is just please be more open minded.
Analyze where your code spends its execution time. You might be surprised how much of it is spent in things like C++ streams, xprintfs and memory allocation. Unfortunately inter-core synchronization is more insidious. It's not visible in benchmarks on small systems. Often you only start to see hints of this problem when actually running on more cores. Enough of them, and that's all your code is doing.