|
|
|
|
|
by rcgorton
1333 days ago
|
|
Re: register windows. I disagree: code size wasn't the killer here, it was how DEEP the stack got. If your architectural register window spilled at 4 deep, then calls 3 deep were fine, but if you had a set of code attempting to iterate over a tight loop which had 8 calls deep, you were in [performance] trouble. Another divot: asymmetric functional units. Some versions of Alpha supported a PopCount instruction, but it only worked in a single functional unit, which made scheduling a pain, esp. if you had to write in assembly language. I'm not convinced that AVX 256 and AVX 512 are useful for non-matrix operations. Most strings (more importantly, parsing bounded by whitespace) are much shorter than 512 bits (32 bytes). In English, I cannot come up with many words longer than 16 bytes (some place names, antidisestablishmentarianism, chemical compound names, and some other stuff) |
|
I've observed that compared to regular x86-64 code without SIMD, using AVX 256 speeds up the Chacha20 cipher (for long messages so they can be processed in 512-bytes chuncks (8 blocks)) by a factor of 5. Network packets easily exceed 1KB, and files are usually much bigger.
Matrix operations aren't the only viable niche.