Hacker News new | ask | show | jobs
by blacklion 539 days ago
Very strange take on "JIT introduce a lot of error into result". I'm from JVM/Java world, but it is JITted VM too, and in our world question is: why you want to benchmark interpreted code at all!?

Only final-stage, fully-JIT-ted and profile-optimized code is what matter.

Short-lived interpreted / level-1 JITted code is not interesting at all from benchmarking perspective, because it will be compiled fast enough to doesn't matter in grand scheme of things.

6 comments

JIT can be very unpredictable. I've seen cases with JVM of running the exact same benchmark in the same VM twice having the second run be 2x slower than the first, occurrences of having ran one benchmark before another making the latter 5x slower, and similar.

Sure, if you make a 100% consistent environment of a VM running just the single microbenchmark you may get a consistent result on one system, but is a consistent result in any way meaningful if it may be a massive factor away from what you'd get in a real environment? And even then I've had cases of like 1.5x-2x differences for the exact same benchmark run-to-run.

Granted, this may be less of a benchmarking issue, more just a JIT performance issue, but it's nevertheless also a benchmarking issue.

Also, for JS, in browser specifically, pre-JIT performance is actually a pretty meaningful measurement, as each website load starts anew.

How long did you run the benchmark if you got so large variation?

For simple methods I usually run the benchnarkes method 100k times, 10k is minimum for full JIT.

For large programs I have noticed the performance keeps getting better for the first 24 hours, after which I take a profiling dump.

Most of the simple benches I do are for ~1 second. The order-dependent things definitely were reproducible (something along the lines of rerunning resulting in some rare virtual method case finally being invoked enough times/with enough cases to heavily penalize the vastly more frequent case). And the case of very different results C2 deciding to compile the code differently (looking at the assembly was problematic as adding printassembly whatever skewed the case it took), and stayed stable for tens of seconds after the first ~second iirc (though, granted, it was preview jdk.incubator.vector code).
> I'm from JVM/Java world, but it is JITted VM too, and in our world question is: why you want to benchmark interpreted code at all!?

Java gives you exceptional control over the JVM allowing you to create really good benchmark harnesses. That today is not the case with JavaScript and the proliferation of different runtimes makes that also harder. To the best of my knowledge there is no JMH equivalent for JavaScript today.

When JITing Java, the main profiling inputs are for call devirtualization. That has a lot of randomness, but it's confined to just those callsites where the JIT would need profiling to devirtualize.

When JITing JavaScript, every single fundamental operation has profiling. Adding stuff has multiple bits of profiling. Every field access. Every array access. Like, basically everything, including also callsites. And without that profiling, the JS JIT can't do squat, so it depends entirely on that profiling. So the randomness due to profiling has a much more extreme effect on what the compiler can even do.

Javascript code is often short lived and doesn't have enough time to wait for the JIT to watm up.
> Short-lived interpreted / level-1 JITted code is not interesting at all from benchmarking perspective, because it will be compiled fast enough to doesn't matter in grand scheme of things.

This is true for servers but extremely not true for client-side GUI applications and web apps. Often, the entire process of [ user starts app > user performs a few tasks > user exits app ] can be done in a second. Often, the JIT never has a chance to warm up.

If it is done in literal second, why will you benchmark it?

In such case you need "binary" benchmark: does user need to wait or not? You don't need some fancy graphics, percentiles, etc.

And in such case your worse enemy is not JIT but variance of user's hardware, from old Atom netbook to high-end working station with tens of 5Ghz cores. Same for RAM and screen size.

> If it is done in literal second, why will you benchmark it?

The difference between one second and two seconds can be the difference between a happy user and an unhappy user.

> You don't need some fancy graphics, percentiles, etc.

You don't need those to tell you if your app is slow, you need them to tell you why your app is slow. The point of a profiler isn't to identity the existence of a performance problem. You should know you have a performance problem before you ever bother to start your profiler. The point is to give you enough information so that you can solve your performance problem.

> your worse enemy is not JIT but variance of user's hardware, from old Atom netbook to high-end working station with tens of 5Ghz cores. Same for RAM and screen size.

Yes, this are a real problem that client-side developers have to deal with. It's hard.

Agreed, comparing functions in isolation can give you drastically different results from the real world, where your application can have vastly different memory access patterns.
Does anyone know how well the JIT/cache on the browser works eg. how useful it is to profile JIT'd vs non-JIT'd and what those different scenarios might represent in practice? For example is it just JIT-ing as the page loads/executes, or are there cached functions that persist across page loads, etc?