Hacker News new | ask | show | jobs
by benhoyt 1769 days ago
Wow, this update is awesome: my GoAWK interpreter (https://github.com/benhoyt/goawk) runs a simple CPU-bound AWK program 38% faster when compiled with Go 1.17 (compared to 1.16).

  $ time goawk_go1.16 'BEGIN { for (i=0; i<100000000; i++) s += i; print(s) }'
  4999999950000000
  real    0m10.158s ...
  $ time goawk_go1.17 'BEGIN { for (i=0; i<100000000; i++) s += i; print(s) }'
  4999999950000000
  real    0m6.268s ...
I wonder why it's so much better than their advertised 5% perf improvement? Here's a quick CPU profile: https://i.imgur.com/csJyOYq.png ... I don't get too much out of it at a glance, just seems like everything's a bunch faster.
6 comments

Hi, I'm one of the people who worked on it, and the guy who did the initial estimate back in early 2017. 5% is the geomean of a lot of benchmarks; a whole lot fall in the the 4-8% range, a few do worse because the new ABI creates new patterns of register use that don't fit well with the current register allocator, and the fix was larger than we wanted to risk. (See https://github.com/golang/go/issues/46216 )
Overall for GoAWK I get an 18% speed increase on my micro-benchmarks between Go 1.16 and 1.17 (see https://github.com/benhoyt/goawk/commit/1f314f421273b3dc164f...) and I measured an 8% speed increase on my "slightly more real-world" benchmarks (these ones: https://github.com/benhoyt/goawk/blob/master/benchmark_awks....).
The benefits come primarily from avoiding extra work spilling arguments to/from the stack on function calls. If you are making lots and lots of function calls, particularly to small functions that can't be inlined, there could certainly be much bigger improvements.
just an fyi: you can use the -diff_base flag to diff the profiles without opening both profiles side-by-side.
Oh, good to know, thanks!
The speed gained depends a lot on the structure of the code benchmarked. Natively written Go code has more computation happening in local loops without many function calls, the optimization brings less effect. An interpreter often calls a function for every single directive executed. This means, you have a lot of function calls inside loops, sometimes for every single operation executed. This of course profits massively from this optimization.
Look at the disassembly and observe how your function calls have far fewer push/pop operations going on, and how the function prologues/epilogues are smaller.