Hacker News new | ask | show | jobs
by bayindirh 1548 days ago
> Intel really wanted to release (in my opinion) very interesting processors...

No. Intel just wanted to show some little improvements to keep performance gap constant, hide their neat tricks until competition catch up, and use them when clients or the market really demanded it.

The only notable efforts I've seen were reducing performance penalty of SpeedStep performance switching, making better memory controllers to "Catch" AMD, and other power-gating and independent throttling capabilities to address density issues in systems.

When fab/power/thermal issues became apparent, they started to hide AVX/AVX2 frequencies, created frankenprocessors for some applications, etc.

However, I've seen no real effort to make groundbreaking innovations in x86 space rather than protecting what they already had.

Performance counters, and other underlying piping to make processor observable was nice though.

As a result, I can still use a 3rd generation i7 as a daily driver for almost all tasks at hand, including development. The only definitive performance difference shows itself when I run my scientific code after compiling it with platform platform specific optimizations on newer systems. On that regards, an M1 MacBook air can be 25% faster than a 7th gen 7700K processor, and I find it ironic.

1 comments

> However, I've seen no real effort to make groundbreaking innovations in x86 space rather than protecting what they already had.

I consider what AVX-512 has to offer to be highly innovative.

Unluckily, just when they planned to introduce AVX-512 into most desktop/laptop CPUs (not just server CPUs or special-purpose accelerators), the problems with 10 nm occured. So this was delayed a lot and even today, many desktop/laptop CPUs of Intel have no support for this feature.

Also Intel TSX was in my opinion really innovative (even though this feature was to my knowledge mostly used in (business) databases; what a pity).

I wouldn't call wider SIMD lanes terribly innovative. Particularly when they suffer from power costs to evaluate, time penalties just to fill the registers with enough data from cache or memory, and real workloads don't benefit from SIMD as much in practice when compilers are terrible at autovectorization (and humans are only marginally better at doing it manually).

AVX-512 is an example of a feature that improves special cases that show up in faux-workloads (eg: fancy benchmarks and HPC) but does not manifest higher performance for the vast majority of workloads, including things that ostensibly should be embarrassingly parallel and reap gains from SIMD.

SIMD lines are just a miniaturization of older vector processors as co-processors, a-la CRAY in a box.

As an HPC sysadmin and scientific software developer/researcher, I can confidently say that SIMD can provide real performance gains, however there are trade-offs and decisions to be made.

- First of all, SIMD is very data-hungry. You either need to constantly push data into it, or modify the data you've pushed a lot. Otherwise you just sit.

- Then there comes power and frequency penalty. In Intel's case, it needs humongous amount of power in CPU budget terms, and it creates heat and slowdowns. So you have to test your code with SIMD or without it (-mtune, -march, etc.). If your code is as speedy or faster, use SIMD.

- Moreover, you can't just compile an extremely optimized binary and fan it out. Older processors will just throw "illegal instruction" and halt. You either will provide multiple binaries with specific optimizations for each, or lowest common denominator for a vendor (AMD binary and Intel binary), or just throw all out. The best way is giving the source out and providing a simple makefile to let the researcher/user compile it, but not all code is open, one may guess. Creating a universal with multiple code paths is also possible, yet needs a lot of elbow grease, and may not be always optimal.

- Lastly, your code don't have to be embarrassingly parallel to be able to use SIMD. Matrix/linear algebra libraries like Eigen can almost abuse the processor's all units when compiled with correct flags (-O3, -mtune=native, -march=native). However, if you want to accelerate small data with SIMD, you need to create a parallel loop which needs to saturate SIMD pipelines. Which OpenMP can easily do with parallel_for.

All of this doesn't change that SIMD is a special horse which can't run in all courses, however its not useless.

I didn't say it was useless, just that it wasn't a magic bullet and AVX-512 isn't particularly innovative, and doesn't solve most users' problems.

I think you're missing the point of my post, I agree with all your points in specificity (except one, but not the forum to discuss FMV in modern compilers) but they miss the grander point that Intel hasn't made computers faster via more SIMD. The amount of expertise required to make use of it is just more evidence of that.

AVX512 was clearly a great innovation in the vectorization landscape. A far cleaner instruction set, complete and symmetric, with very interesting blend, ternlog, lane-crossing instructions and the especially interesting mask registers. Lots and lots of goodies and an eye for compiler implementation.

I feel Intel failed hard at diffusion of the ISA (why not put it everywhere, with half-perf, it'll improve later, no change in code) and also at not pushing more energy/dollars into ispc. Yeah yeah your compiler engineers are clever, but you've been doing this for 20 years and autovectorization is still ways off. Let me write code in a way that can be easily vectorized. A subset of C. Less awkward than cuda.

Now it seems AVX512 and large vector units is dying and still is too niche. Sad.

The cleanup being tied to the width increase was the first problem. The new width still being a fixed one was the second.

SVE is SIMD actually done right – on the Arm side in the near future, everything from smartphones to massive HPC boxes will be covered by the same clean SIMD ISA.

I agree it would have been nice to have 'infinite sized' instructions, chopped up to the actual underlying vector size. But there were so many complaints about AMD not implementing some instructions as 256 bit-wide but 2x128 that I feel they went for the least microcode route.

Mask registers offset the size problem a bit. I just wish we'd rebuild a language or clean libraries to take full advantage of this programming model. Is ispc still maintained? Does anyone use it in prod? Genuinely curious.

I feel SVE is 'too late' as most CPU makers seem to go back to smaller vector units (leaving the vectorized stuff to gpus - I know they're not the same thing, but if you're investing in heavy perf hardware, for repetitive computing...) and even Intel doesn't seem very serious about AVX512 except in the Xeon world. But then if you pay 8000EUR for a platinum thing, you might be able to pay for top talent to handcraft some intrinsics.