AVX-512 looks kind of nuts: 32 registers of 64 bytes each, so 2KB of just registers, and apparently gate area to do eight 64-bit multiplies/divides at once. Also adds masking (only run this xor instruction on these three of the eight 64-bit words) and other stuff.
One other interesting feature for the long term, apparently in regular Skylake too, is a bounds-checking assist (MPX): instructions and registers and address-lookup hardware to make bounds checks cheaper. (The bounds check instructions are effectively NOPs on older hardware, I think.) I don't know what the economics of supporting it are, but I like anything that might lead to more code deployed with more safety belts.
Finally, I wonder when Skylake server is coming out. The process delays threw off their usual tick-tock rhythm; I wonder if it means large Skylake server chips will come out with less than the usual delay after the top of the Broadwell server line (which isn't out yet), or, less likely, if Intel will skip large Broadwell Xeons entirely.
AVX-512 was the only interesting part of the microarchitecture, and honestly the only reason I would bother to spend the money to buy one. I imagine it will greatly reduce the adoption rate of the AVX-512 features and capabilities. Disappointing.
One other interesting feature for the long term, apparently in regular Skylake too, is a bounds-checking assist (MPX): instructions and registers and address-lookup hardware to make bounds checks cheaper. (The bounds check instructions are effectively NOPs on older hardware, I think.) I don't know what the economics of supporting it are, but I like anything that might lead to more code deployed with more safety belts.
Finally, I wonder when Skylake server is coming out. The process delays threw off their usual tick-tock rhythm; I wonder if it means large Skylake server chips will come out with less than the usual delay after the top of the Broadwell server line (which isn't out yet), or, less likely, if Intel will skip large Broadwell Xeons entirely.