Hacker News new | ask | show | jobs
by dan-robertson 543 days ago
Re VM warmup, see https://tratt.net/laurie/blog/2022/more_evidence_for_problem... and the linked earlier research for some interesting discussion. Roughly, there is a belief when benchmarking that one can work around not having the most-optimised JIT-compiled version by running your benchmark a number of times and then throwing away the result before doing ‘real’ runs. But it turns out that:

(a) sometimes the jit doesn’t run

(b) sometimes it makes performance worse

(c) sometimes you don’t even get to a steady state with performance

(d) and obviously in the real world you may not end up with the same jitted version that you get in your benchmarks

3 comments

In general, this isn't even a JS problem, or a JIT problem. You have similar issues even in a lower-level language like C++: branch prediction, cache warming, heck, even power state transitions if you're using AVX512 instructions on an older CPU. Stop-the-world GC causing pauses? Variation in memory management exists in C, too -- malloc and free are not fixed-cost, especially under high churn.

Benchmarks can be a useful tool, but they should not be mistaken for real-world performance.

I think that sort of thing is a bit different because you can have more control over it. If you’re serious about benchmarking or running particularly performance sensitive code, you’ll be able to get reasonably consistent benchmark results run-to-run, and you’ll have a big checklist of things like hugepages, pgo, tickless mode, writing code a specific way, and so on to get good and consistent performance. I think you end up being less at the mercy of choices made by the VM.
The amount of control you have varies in a continuum between hand-written assembly to SQL queries. But there isn't really a difference of kind here, it's just a continuum.

If there's anything unique about Javascript is that has an unusually high rate of "unpredictability" / "abstraction level". But again, it has pretty normal values of both of those, just the relation is away from the norm.

I’m quite confused by your comment. I think this subthread is about reasons one might see variance on modern hardware in your ‘most control’ end of the continuum.

The start of the thread was about some ways where VM warmup (and benchmarking) may behave differently from how many reasonably experienced people expect.

I claim it’s reasonably different because one cannot tame parts of the VM warmup issues whereas one can tame many of the sources of variability one sees in high-performance systems outside of VMs, eg by cutting out the OS/scheduler, by disabling power saving and properly cooling the chip, by using CAT to limit interference with the L3 cache, and so on.

> eg by cutting out the OS/scheduler, by disabling power saving and properly cooling the chip, by using CAT to limit interference with the L3 cache, and so on

That's not very different from targeting a single JS interpreter.

In fact, you get much more predictability by targeting a single JS interpreter than by trying to guess your hardware limitations... Except, of course if you target a single hardware specification.

When we were upgrading to ES6 I was surprised/relieved to find that moving some leaf node code in the call graph to classes from prototypes did help. The common wisdom at the time was that classes were still relatively expensive. But they force strict, which we were using inconsistently, and they flatten the object representation (I discovered these issues in the heap dump, rather than the flame graph). Reducing memory pressure can overcome the cost of otherwise suboptimal code.

OpenTelemetry having not been invented yet, someone implemented server side HAR reports and the data collection on that was a substantial bottleneck. Particularly the original implementation.

Running benchmarks on the old Intel MacBook a previous job gave me was like pulling teeth. Thermal throttling all the time. Anything less than at least a 2x speed up was just noise and I’d have to push my changes to CI/CD to test, which is how our build process sprouted a benchmark pass. And a grafana dashboard showing the trend lines over time.

My wheelhouse is making lots of 4-15% improvements and laptops are no good for those.

Benchmarking that isn't based on communicating the distribution of execution times is fundamentally wrong on almost any platform.
I think Tratt’s work is great, but most of the effects that article highlights seem small enough that I think they’re most relevant to VM implementors measuring their own internal optimizations.

Iirc, the effects on long running benchmarks in that paper are usually < 1%, which is a big deal for runtime optimizations, but typically dwarfed by the differences between two methods you might measure.

Cross cutting concerns run into these sorts of problems. And they can sneak up on you because as you add these calls to your coding conventions, it’s incrementally added in new code and substantial edits to old, so what added a few tenths of a ms at the beginning may be tens of milliseconds a few years later. Someone put me on a trail of this sort last year and I managed to find about 75 ms of improvement (and another 50ms in stupid mistakes adjacent to the search).

And since I didn’t eliminate the logic I just halved the cost, that means we were spending about twice that much. But I did lower the slope of the regression line quite a lot, and I believe enough that new nodeJS versions improve response time faster than it was organically decaying. There were times it took EC2 instance type updates to see forward progress.

I think you might be responding to a different point than the one I made. The 1% I'm referring to is a 1% variation between subsequent runs of the same code. This is a measurement error, and it inhibits your ability to accurately compare the performance of two pieces of code that differ by a very small amount.

Now, it reads like you' think I'm saying you shouldn't care about a method if it's only 1% of your runtime. I definitely don't believe that. Sure, start with the big pieces, but once you've reached the point of diminishing returns, you're often left optimizing methods that individually are very small.

It sounds like you're describing a case where some method starts off taking < 1% of the runtime of your overall program, and grows over time. It's true that if you do an full program benchmark, you might be unable to detect the difference between that method and an alternate implementation (even that's not guaranteed, you can often use statistics to overcome the variance in the runtime).

However, you often still will be able to use micro-benchmarks to measure the difference between implementation A and implementation B, because odds are they differ not by 1% in their own performance, but 10% or 50% or something.

That's why I say that Tratt's work is great, but I think the variance it describes is a modest obstacle to most application developers, even if they're very performance minded.

> Now, it reads like you' think I'm saying you shouldn't care about a method if it's only 1% of your runtime. I definitely don't believe that.

Nor do I. I make a lot of progress down in the 3% code. You have to go file by file and module by module rather than hotspot by hotspot down here, and you have to be very good at refactoring and writing unit/pinning tests to get away with it. But the place I worked where I developed a lot of my theories I kept this up for over two years, making significant perf improvements every quarter before I had to start looking under previously visited rocks. It also taught me the importance of telegraphing competence to your customers. They will put up with a lot of problems if they trust you to solve them soon.

No I’m saying the jitter is a lot more than one percent, and rounding errors and externalities can be multiples of the jitter. Especially for leaf node functions - which are a fruitful target because it turns out your peers mostly don’t have opinions about “clever” optimizations if they’re in leaf node functions, in commits they could always roll back if they get sick of looking at them. Which is both a technical and a social problem.

One of the seminal cases for my leaving the conventional wisdom: I’d clocked a function call as 10% of an expensive route, via the profiler. I noticed that the call count was way off. A little more than twice what I would have guessed. Turned out two functions were making the call with the same input, but they weren’t far apart in the call tree, so I rearranged the code a bit to eliminate the second call, expecting the e2e time to be 5% faster. But it was repeatably 20% faster, which made no sense at all, except for memory pressure.

About five years later I saw the most egregious case of this during a teachable moment when I was trying to explain to coworkers that fixing dumb code >> caching. I eliminated a duplicate database request in sibling function calls, and we saw the entire response time drop by 10x instead of the 2-3x I was expecting.