Hacker News new | ask | show | jobs
by joesavage 3357 days ago
> First, compilers are really quite good. Yes, it's possible to beat them (I do) but generally not by much.

Genuine question: what would you quantify “not by much” as here? In my experience, it's not uncommon to see 2x+ speedups from well-written software pipelined hand-optimised assembly in hot loops. (Especially so for in-order processors, like you might see in embedded applications or as a LITTLE core in your smartphone.)

1 comments

2x in a hotspot is not much compared with better cache management in the rest the program. But if that 2x is important, great. I know this sounds really boring but you should write it in C first and then find that hot loop in a profiler like VTune and then rewrite. 2x of something really unimportant is still unimportant.

Also while software pipelining is possible on something like Haswell ... its limited register set makes it a limited technique. Modular variable expansion is tough with not so many registers but renaming does help somewhat.

I think tools like VTune are awesome for finding hotspots and reading assembler is like reading Latin. But programming in assembler? I think it's best to disassemble and rewrite your C accordingly.

I should have mentioned this in the first post: if you're not a VTune ace, if you're not looking at the Intel PMRs and scratching your head, you probably should not be writing in assembler in 2017. Also, VTune deals with C (and Java) quite nicely (just not on OS X).

Anyways, you may get 2x+ on something but that approach won't work with a GPU. Similarly, Apple doesn't provide microarchitectural information on the a10. Nvidia doesn't on Denver. Even though assembly will still be with us, this low level approach is going away. Apple, Intel, Arm, Nvidia, ... really want you to write in C.

I say all this as a hardcore knuckle dragger.