|
|
|
|
|
by frozenport
3274 days ago
|
|
This is stupid because it categorically denies the business case for micro-optomization. Yet, a business case might exist when the library is heavily utilized, or often when a compiler isn't able to produce the correct code. There are also cases primarily, in finance, where single threaded low-latency distinguishes competing groups. Some of those guys count every nanosecond. The techniques described here ( and in other places) are universally applicable. |
|
Try as I might, I could not beat GCC [2], which used non-vectorized code. I chalk it up to not knowing how best to write optimized x86 code anymore (it's been years since I did any real assembly language programming) and I might be hitting some scheduling or pipeline issues, I just don't know.
[1] I described the code years ago here: http://boston.conman.org/2004/06/09.2
[2] I beat clang easily though.