|
|
|
|
|
by NobleExpress
1171 days ago
|
|
I will say that you seem to be stating "smearing that over total execution time tends to give better results" without proof as well. I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient. Yes RC has smaller minheap sizes than tracing GC, but garbage collection is fundamentally a time-space tradeoff and RC is highly optimized for memory efficiency. Also most naive RC implementations don't copy objects so you get heap fragmentation. Comparing naive RC to tracing GC is non-trivial. In order to have a fair comparison, you'd have to implement naive RC and tracing GC in the same system and then compare them across a set of benchmarks. I personally have never come across a great performance study which compared naive RC to tracing GC. |
|
Sure. Atomic counters are relatively expensive. But I’m good designs there’s very few of them. And they easily show up in hotspots if they’re a problem and you fix your object model. The problem with tracing GC is that you have no way to fix it. Most languages that use RC actually avoid most memory allocations/frees by leveraging value composition instead of referential ownership.
I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW. Similarly, any big hyperscaler is unlikely to be using Java for their core performance-critical infrastructure. To me those are pretty clear performance advantages.
> I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient.
I’m unaware of anyone using RC properly is doing this. You only do this when you need to share ownership but you should be doing this exceedingly sparingly. Unique ownership and referential sharing is by far the most common. If you’re passing RC into a function that doesn’t retain ownership beyond its call scope you’re not using RC properly.
> Also most naive RC implementations don't copy objects so you get heap fragmentation.
That’s only kind of true. Good allocators seem to mitigate this problem quite effectively (glibc’s is notably not good at this as compared with the mimalloc and new tcmalloc). But sure, that is kind of a problem. It can be mitigated though by optimizing your allocation patterns once you know that’s the problem (noticing it is by far the hardest bit). And because it’s more manual you can customize your allocator.
Look. I’m not disagreeing about the developer benefits being significant. I’m just saying that good memory management (made fairly easy in Rust) is always going to outperform tracing GC the same way optimized assembly will outperform the compiler. It’s possible that a tracing GC can provide better performance out the gate with minimal optimization vs needing to spend more time optimizing your allocation strategies if you make design mistakes. But remember. A tracing garbage collector still needs to do atomic reads of data structures which potentially requires cross cpu shoot downs. And you still generally need to stop the world (I think that may not be true in some advanced Java Gc algorithms but that’s the exception rather than the rule and you trade off even lower throughput). And I can’t belabor this point enough - languages with RC use it rarely as shared ownership is rarely needed and you can typically minimize where you use it.
Let me give you a real world example. When I was working on the indoor positioning in iOS, I ported our original Java codebase verbatim where I used shared_ptr for almost every Java allocation across the board where I might even potentially be sharing ownership as I wanted to start with safety and optimize later. Not only was the initial version faster than the equivalent Java code (not surprising since c++ will outperform due to no auto boxing, at least at the time), when I got rid of shared_ptr in the particle filter which is a core hot path, it only showed a 5-10% improvement in perf (using Accelerate for the core linear algebra code was way more impactful). The vast majority of it actually came from the fact that all the particles were now living continuously within the vector rather than the overhead of the atomic counting. Just saying. People really overestimate the cost of RC because GC is rarely needed in the first place / when it is ownership shouldn’t be being modified in your hot path. When I worked on Oculus on Link, we used shared_ptr liberally in places because, again, ownership is actually rarely shared - most allocations are unique_ptr.
Edit: note that I’m explicitly distinguish RC (single threaded) from ARC (atomic multi thread RC). Confusingly ARC stands for automatic RC in Apple land although it’s atomic there too. Automatic RC is trickier but again, as Swift and ObjC demonstrate, it’s generally good enough without any serious performance implications (throughput or otherwise).