Hacker News new | ask | show | jobs
by pizlonator 547 days ago
(I designed JavaScriptCore's optimizing JITs and its garbage collector and a bunch of the runtime. And I often benchmark stuff.)

Here's my advice for how to run benchmarks and be happy with the results.

- Any experiment you perform has the risk of producing an outcome that misleads you. You have to viscerally and spiritually accept this fact if you run any benchmarks. Don't rely on the outcome of a benchmark as if it's some kind of Truth. Even if you do everything right, there's something like a 1/10 risk that you're fooling yourself. This is true for any experiment, not just ones involving JavaScript, or JITs, or benchmarking.

- Benchmark large code. Language implementations (including ahead of time compilers for C!) have a lot of "winning in the average" kind of optimizations that will kick in or not based on heuristics, and those heuristics have broad visibility into large chunks of your code. AOTs get there by looking at the entire compilation unit, or sometimes even your whole program. JITs get to see a random subset of the whole program. So, if you have a small snippet of code then the performance of that snippet will vary wildly depending on how it's used. Therefore, putting some small operation in a loop and seeing how long it runs tells you almost nothing about what will happen when you use that snippet in anger as part of a larger program.

How do you benchmark large code? Build end-to-end benchmarks that measure how your whole application is doing perf-wise. This is sometimes easy (if you're writing a database you can easily benchmark TPS, and then you're running the whole DB impl and not just some small snippet of the DB). This is sometimes very hard (if you're building UX then it can be hard to measure what it means for your UX to be responsive, but it is possible). Then, if you want to know whether some function should be implemented one way or another way, run an A:B test where you benchmark your whole app with one implementation versus the other.

Why is that better? Because then, you're measuring how your snippet of code is performing in the context of how it's used, rather than in isolation. So, your measurement will account for how your choices impact the language implementation's heuristics.

Even then, you might end up fooling yourself, but it's much less likely.

7 comments

I completely agree with this advice. Micro-benchmarking can work well as long as you already have an understanding of what's happening behind the scenes. Without that it greatly increases the chance that you'll get information unrelated to how your code would perform in the real world. Even worse, I've found a lot of the performance micro-benchmarking websites can actually induce performance issues. Here's an example of a recent performance bug that appears to have been entirely driven by the website's harness. https://bugs.webkit.org/show_bug.cgi?id=283118
Love this. I have done a fair amount of UI performance optimization and agree with the end-to-end strategy.

For UX stuff, 2 steps I’d add if you're expecting a big improvement:

1) Ship some way of doing a sampled measurement in production before the optimization goes out. Networks and the spec of the client devices may be really important to the UX thing you're trying to improve. Likely user devices are different from your local benchmarking environment.

2) Try to tie it to a higher level metric (e.g. time on site, view count) that should move if the UI thing is faster. You probably don't just want it to be faster, you want the user to have an easier time doing their thing, so you want something that ties to that. At the very least this will build your intuition about your product and users.

great points! i do a lot of JS benchmarking + optimization and whole-program measurement is key. sometimes fixing one hotspot changes the whole profile, not just shifts the bottleneck to the next biggest thing in the original profile. GC behaves differently in different JS vms. sometimes if you benchmark something like CSV parsers which can stress the GC, Benchmark.js does a poor job by not letting the GC collect properly between cycles. there's a lengthy discussion about why i use a custom benchmark runner for this purpose [1]. i can recommend js-framework-benchmark [2] as a good example of one that is done well, also WebKit's speedometer [3].

[1] https://github.com/leeoniya/uDSV/issues/2

[2] https://github.com/krausest/js-framework-benchmark

[3] https://github.com/WebKit/Speedometer

> there's something like a 1/10 risk that you're fooling yourself.

You’re being generous or a touch ironic. It’s at least 1/10 and probably more like 1/5 on average and 1/3 for people who don’t take advice.

Beyond testing changes in a larger test fixture, I also find that sometimes multiplying the call count for the code under examination can help clear things up. Putting a loop in to run the offending code 10 times instead of once is a clearer signal. Of course it still may end up being a false signal.

I like a two phase approach, wheee you use a small scale benchmark while iterating on optimization ideas, with checking the larger context once you feel you’ve made progress, and again before you file a PR.

At the end of the day, eliminating accidental duplication of work is the most reliable form of improvement, and one that current and previous generation analysis tools don’t do well. Make your test cases deterministic and look at invocation counts to verify that you expect n calls of a certain shape to call the code in question exactly kn times. Then figure out why it’s mn instead. (This is why I say caching is the death of perf analysis. Once it’s added this signal disappears)

The first half of the sentence you quoted is "even if you do everything right". What is the point of selectively quoting like that and then responding to something you know they didn't mean?
With respect, the first half of the sentence sounds more like waffling than a clear warning to your peers. I bond with other engineers over how badly the industry as a whole handles the scientific method. Too many of us can’t test a thesis to the satisfaction of others. Hunches, speculation, and confidently incorrect. Every goddamn day.

Feynman said: The most important thing is not to fool yourself, and you’re the easiest person to fool.

That’s a lot more than 1/10, and he’s talking mostly to future scientists, not developers.

Excellent advice. It’s also very important to know what any micro benchmarks you do have are really measuring. I’ve seen enough that actually measured the time to setup or parse something because they dominated and wasn’t cached correctly. Conversely I’ve seen cases where the JIT correctly optimised away almost everything because there was a check on the final value.

Oh, and if each op takes under a nanosecond than your benchmark is almost certainly completely broken.

I don't really disagree with anything you said, but having to run end to end tests for any benchmarking is far from ideal. For one thing, they are often slow,and to get reliable results, you have to run them multiple times, which makes it even slower. And that makes it more difficult, and expensive, to try a lot of things and iterate, and have a short feedback loop. IME writing good end to end tests is also just generally more difficult than writing unit tests of smaller benchmarking code.
ThePrimeTime posted a livestream recording a few days ago where he and his guest dove into language comparison benchmarks. Even the first 10 minutes touches on things I hadn't thought of before beyond the obvious "these are not representative of real world workloads." It's an interesting discussion if you have the time.

https://www.youtube.com/watch?v=RrHGX1wwSYM