| While it's true that there is nothing magical in C4, pretty much everything you describe here is wrong (as in the exact opposite of what is going on). Specifically: - C4 makes no use of hardware transactional memory. - C4 works great (and natively) on commodity x86-64 cores. - C4 does not emulate anything (not memory architecture, not HTM). - C4's transactional throughput kicks ass. If you want to know how C4 works, all you need to do is read the actual C4 ISMM paper: http://www.azulsystems.com/sites/default/files/images/c4_pap...
<And yes, I'm one of the authors> How you got to your posted conclusions from the link you posted is a mystery to me. Cliff doesn't say anything much about the GC mechanism there: he is talking about lessons learned in designing custom hardware. And yes, Vega had some cool GC-assist features, but they are in no way part of the C4 collector mechanism. Since you raised the bogus, unsubstantiated assertion that "[Zing's] transactional throughput in the vast majority of workloads sucks" based purely on your mis-interpretation of what C4 actually does, I feel I must correct that notion. To do that, I'll point out that Zing (with C4) is currently used, in production, to run some of the highest transactional throughput and most latency sensitive applications in the world. Zing's sustainable throughput (the throughput at which the system still meets the required SLA) is generally dramatically better than other JVMs running on exactly the same modern x86 hardware. Yes, you read that right: Zing gets more production-worthy TPS out of the same x86 hardware. Not less. This is why people who actually need good throughput and good latency tend to graduate to using Zing, and start enjoying time in their hammocks as a result. See http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2014-O... for a happy example. C4 is being used in everything from low latency trading (Algo, HFT, smart routing wire risk) to Online Retail (think black Friday workloads) and travel sites. C4 is used for everything from 1GB compute-heavy and messgaing workloads to 1TB data-heavy analysts applications. It powers big data workloads. It powers search. It powers Java servers of all kinds, big and small. And none of those are complaining about throughput. Quite the opposite. Peace. |
If you read their published algorithm and are familiar with lock-less algorithms, it's clear that theirs is a transactional memory algorithm. Specifically, their LVB primitive. If this isn't obvious to you, I would recommend reading the seminal transactional memory algorithms from the 1970s and 1980s, including everything written by Hoare. Most of those are available from the ACM library. You particularly need to pay close to attention to how wait-free algorithms are achieved.
"Transactional memory" is not a marketing term, nor a synonym for a particular set of CPU instructions. It's a class of lock-free algorithms, especially lock-free, wait-free algorithms. And the C4 collector very clearly fits into that class of algorithms. It's use of page remapping and read/write page protections is precisely how you would emulate strong transactional memory primitives on x86.
I think this terse quotation (from their own research paper) sums up the relationship between the Vega hardware and the Linux software-based implementation:
"Azul has created commercial implementations of the C4 algorithm on three successive generations of its custom Vega hardware (custom processor instruction set, chip, system, and OS), as well on modern X86 hardware. While the algorithmic details are virtually identical between the platforms, the implementation of the LVB semantics varies significantly due to differences in available instruction sets, CPU features, and OS support capabilities." (http://www.azulsystems.com/sites/default/files/images/c4_pap...)
Regarding the performance of C4, the reason Azul doesn't publish TPC benchmarks is because there's no avoiding the immense costs of their page mapping hacks. From the paper above: "the garbage collector needs to sustain a page remapping at a rate that is 100x as high as the sustained object allocation rate for comfortable operation."
Page remapping is insanely expensive at the micro-granularity needed. They mitigate the cost by batching requests, but it's still significant. Furthermore, they must use atomic reads and writes for internal pointers. Those are cache killers.
I never said C4 can't be faster for particular workloads. Obviously for workloads sensitive to latency a pauseless collector can be faster overall. But as a general matter, those workloads are not in the majority. Ergo, for the majority of workloads C4 will not be faster, at least not on commodity hardware architectures.
You can continue to believe the hype, and believe that Azul possesses some sort of magical fairy dust, using techniques entirely beyond the comprehension of mere mortals. Or you can read about and learn how it _actually_ works. Their algorithm and implementations are all laudable and significant achievements. But there's nothing magical or secret about them.