| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by derf_ 303 days ago
	One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after. [0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.

2 comments

saagarjha 303 days ago

Unfortunately modern processors do not work how most people think they do. Optimizing for less work for a nebulous idea of what "work" is generally loses to bad memory access patterns or just using better instructions that seem most expensive if you look at them superficially.

link

astrange 300 days ago

If you're important enough they'll design the next processor to run your code better anyway.

(Or at least add new features specifically for you to adopt.)

link

magicalhippo 303 days ago

It can be sobering to consider how many instructions a modern CPU can execute in case of a cache miss.

In the timespan of a L1 miss, the CPU could execute several dozen instructions assuming a L2 hit, hundreds if it needs to go to L3.

No wonder optimizing memory access can work wonders.

link