| (I designed JavaScriptCore's optimizing JITs and its garbage collector and a bunch of the runtime. And I often benchmark stuff.) Here's my advice for how to run benchmarks and be happy with the results. - Any experiment you perform has the risk of producing an outcome that misleads you. You have to viscerally and spiritually accept this fact if you run any benchmarks. Don't rely on the outcome of a benchmark as if it's some kind of Truth. Even if you do everything right, there's something like a 1/10 risk that you're fooling yourself. This is true for any experiment, not just ones involving JavaScript, or JITs, or benchmarking. - Benchmark large code. Language implementations (including ahead of time compilers for C!) have a lot of "winning in the average" kind of optimizations that will kick in or not based on heuristics, and those heuristics have broad visibility into large chunks of your code. AOTs get there by looking at the entire compilation unit, or sometimes even your whole program. JITs get to see a random subset of the whole program. So, if you have a small snippet of code then the performance of that snippet will vary wildly depending on how it's used. Therefore, putting some small operation in a loop and seeing how long it runs tells you almost nothing about what will happen when you use that snippet in anger as part of a larger program. How do you benchmark large code? Build end-to-end benchmarks that measure how your whole application is doing perf-wise. This is sometimes easy (if you're writing a database you can easily benchmark TPS, and then you're running the whole DB impl and not just some small snippet of the DB). This is sometimes very hard (if you're building UX then it can be hard to measure what it means for your UX to be responsive, but it is possible). Then, if you want to know whether some function should be implemented one way or another way, run an A:B test where you benchmark your whole app with one implementation versus the other. Why is that better? Because then, you're measuring how your snippet of code is performing in the context of how it's used, rather than in isolation. So, your measurement will account for how your choices impact the language implementation's heuristics. Even then, you might end up fooling yourself, but it's much less likely. |