Great question! I actually just touched on this in another thread that went up right around the same time you asked this. It is clearly the next big frontier!
The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.
> instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult
I've used a sampling profiler with success to find lock contention in heavily multithreaded code, but I guess there are some details that makes it not viable for this?
That is spot on. Effectively disabling GC to establish a baseline is exactly the methodology used in the Blackburn & Hosking paper [1] I referenced.
In general, for a production JVM like HotSpot, the implicit cost comes largely from the barriers (instructions baked directly into the application code). So even if we disable GC cycles, those barriers are still executing.
If we were to remove barriers during execution, maintaining correctness becomes the bottleneck. We would need a way to ensure we don't mark a live (reachable) object as dead the moment we re-enable the collector.
Would running an application with chosen GC, subtracting GC time reported by methods You introduced, and then comparing with Epsilong-based run be a good estimate of barrier overhead ?
That is a creative idea, but unfortunately, Epsilon changes the execution profile too much to act as a clean baseline for barrier costs.
One huge issue is spatial locality. Epsilon never reclaims, whereas other GCs reclaim and reuse memory blocks. This means their L2/L3 cache hit rates will be fundamentally different.
If you compare them, the delta wouldn't just be the barrier overhead; it would be the barrier overhead mixed with completely different CPU cache behaviors, memory layout etc. The GC is a complex feedback loop, so results from Epsilon are rarely directly transferable to a "real" system.
The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.