|
|
|
|
|
by CalChris
3357 days ago
|
|
2x in a hotspot is not much compared with better cache management in the rest the program. But if that 2x is important, great. I know this sounds really boring but you should write it in C first and then find that hot loop in a profiler like VTune and then rewrite. 2x of something really unimportant is still unimportant. Also while software pipelining is possible on something like Haswell ... its limited register set makes it a limited technique. Modular variable expansion is tough with not so many registers but renaming does help somewhat. I think tools like VTune are awesome for finding hotspots and reading assembler is like reading Latin. But programming in assembler? I think it's best to disassemble and rewrite your C accordingly. I should have mentioned this in the first post: if you're not a VTune ace, if you're not looking at the Intel PMRs and scratching your head, you probably should not be writing in assembler in 2017. Also, VTune deals with C (and Java) quite nicely (just not on OS X). Anyways, you may get 2x+ on something but that approach won't work with a GPU. Similarly, Apple doesn't provide microarchitectural information on the a10. Nvidia doesn't on Denver. Even though assembly will still be with us, this low level approach is going away. Apple, Intel, Arm, Nvidia, ... really want you to write in C. I say all this as a hardcore knuckle dragger. |
|