Hacker News new | ask | show | jobs
by andrewf 1423 days ago
https://sqlite.org/cpu.html#microopt -

Cachegrind is used to measure performance because it gives answers that are repeatable to 7 or more significant digits. In comparison, actual (wall-clock) run times are scarcely repeatable beyond one significant digit [...] The high repeatability of cachegrind allows the SQLite developers to implement and measure "microoptimizations".

There's a bunch of ways for caches to behave differently but have they changed much over the past 20 years? i.e. is the difference between [2022 AMD cache, 2002 AMD cache] significantly greater than the difference between [2002 PowerPC G4 cache, 2002 AMD cache, 2002 Intel cache] ?

2 comments

I would guess yes, just based on the L1/L2 (later L3) use and sizing between all those systems. 2002 vs 2022 is K8 vs 5800X3D for AMD, so you're looking at having 1 core and 64+64KB of L1 cache, 512KB of L2 cache[1] vs 8 cores (+ht) and 32+32KB L1 per core, 512KB L2 per core, 96MB L3.

Just managing the cache access between L2 and L3 I think would be additional consideration, but then you have to consider the actual architectural differences and on server chips locality will matter quite a bit.

[1]: https://en.wikipedia.org/wiki/Athlon_64

I don't know how sophisticates the streaming/prefetch/access pattern prediction the 2002 cpus did was.

I'm speculating, but if that's not modeled, cachegrind may pessimize some less simple predictable patterns and report a lot of expected misses when the cpu would have been able to prefetch it

Agreed, I suspect it'd be most accurate to say the SQLite folks are minimizing their working set.

I picked a couple of random performance commits out of their code repo, and they look like they might keep 1 or 2 lines out of i-cache: https://sqlite.org/src/info/f48bd8f85d86fd93 https://sqlite.org/src/info/390717e68800af9b