Hacker News new | ask | show | jobs
by pjmlp 917 days ago
It was called Larrabee and XeonPhi, they botched it, and the only thing left from that effort is AVX.
1 comments

I used to play with these toys 7-8 years ago. We tried everything, and it was bad at it all.

Traditional compute? The cores were too weak.

Number crunching? Okay-ish but gpus were better.

Useless stuff.

They seemed exceedingly hard to use well but interestingly capable & full of promise. And they were made in a much more primitive software age.

I'd love to hear about what didn't work. OpenMP support seemed ok maybe but OpenMP is just a platform, figuring out software architectures that's mechanistically sympathetic to the system is hard. It would be so interesting to see what Xeon Phi might have been if we had Calcite or Velox or OpenXLA or other execution engine/optimizers that can orchestrate usage. The possibility of something like Phi seems so much higher now.

There's such a consensus around Phi tanking, and yes, some people came and tried and failed. But most of those lessons, of why it wasn't working (or was!) never survived the era, never were turned into stories & research that illuminates what Phi really was. My feeling is that most people were staying the course on GPU stuff, and that there weren't that many people trying Phi. I'd like more than the heresay heaped at Phi's feed to judge by.

Well... Back then in my shop they would just assign programmers to things, together with a couple of mathematicians.

Math guys came up with a list of algorithms to try for a search engine backend.

What we needed was matrix multiplication and maybe some decision tree walking (that was some time ago, trees were still big back then, NNs were seen as too compute-intensive for no clear benefits). So we thought that it might be cool to have a tool that would support both. Phi sounded just right for both.

And things written to AVX-512 did work. Software surpisingly easy to port.

But then comes the usual SIMD/CPU trouble: every SIMD generation wants a little software rewrite. So for both Phi generations we had to update our code. For things not compatible with the SIMD approach (think tree-walking) it is just a weak x86.

In theory Phi's were universal, in practice what we got was: okay number crunching, bad generic compute.

GPU was somewhat similar: the software stack was unstable, CUDA just did not materialize as a standard yet. But every generation introduced a massive increase in compute available. And boy did NVIDIA move fast...

So GPU situation was: amazing number crunching, no generic compute.

And then there were a few ML breakthroughs results which rendered everything that did not look like a matrix multiplication obsolete.

PS I wouldn't take this story too seriously, details may vary.

By any chance, Yandex?
Nope but close enough :-)
Some observations:

- Very bad performance at existing x86 workloads, so a major selling point was basically not there in practice, because extracting any meaningful performance required a software rewrite anyway. This was an important adoption criteria; if they outright said "All your existing workloads are compatible, but will perform like complete dogshit", why would anyone bother? Compatibility was a big selling point that ended up meaning little in practice, unfortunately.

- Not actually what x86 users wanted. This was at the height of "Intel stagnation" and while I think they were experimenting with lots of stuff, well, in this case, they were serving a market that didn't really want what they had (or at least wasn't convinced they wanted it).

- GPU creators weren't sitting idle and twiddling their thumbs. Nvidia was continuously improving performance and programmability of their GPUs across all segments (gaming, HPC, datacenters, scientific workloads) while this was all happening. They improved their compilers, programming models, and microarchitecture. They did not sit by on any of these fronts.

Ironically the main living legacy of Phi is AVX-512, which people did and still do want. But that kind of gives it all away, doesn't it? People didn't want a new massively multicore microarchitecture. They wanted new vector instructions that were flexible and easier to program than what they had -- and AVX-512 is really much better. They wanted the things they were already doing to get better, not things that were like, effectively a different market.

Anyway, the most important point is probably the last one, honestly. Like we could talk a lot about compiler optimizations or autovectorization. But really, the market that Phi was trying to occupy just wasn't actually that big, and in the end, GPUs got better at things they were bad at, quicker than Phi got better at things it was bad at. It's not dissimilar to Optane. Technically interesting, and I mourn its death, but the competition simply improved faster than the adoption rate of the new thing, and so flash is what we have.

Once you factor in that you have to rewrite software to get meaningful performance uplift, the rest sort of falls into place. Keep in mind that if you have a $10,000 chip and you can only extract 50% of the performance, you more or less have just $5,000 on fire for nothing in return. You might as well go all the way and use a GPU because at least then you're getting more ops/mm^2 of silicon.

I don't disagree anywhere but I don't think any of these statements actually condemn Xeon Phi outright. It didn't work at the time, and doing it with so little software support to tile out workloads well was a big & possibly bad gambit, but I'm so unsure we can condemn the architecture. There seems to be so few folks who made good attempts and succeeded or failed & wrote about it.

I tend to think there was tons of untapped potential still on the table. And that a failure to adopt potential isn't purely Intel alone's fault. The story we are commenting on is about the rest-of-industry trying to figure out enduring joint strategies, and much of this is chipmaker provided, but it is also informed and helped by plenty of consumers also pouring energy in to figure out what's working and not, trying to push the bounds.

Agreed that anyone going in thinking Xeon Phi would be viable for running a boring everyday x86 workload was going to be sad. To me the promise seemed clear that existing toolchains & code would work, but it was always clear to me there were a bunch of little punycores & massive SIMD units and that doing anything not SIMD intensive wasn't going to go well at all. But what's the current trend? Intel and AMD are both actively building not punycores but smaller cores, with Sierra Forest and Bergamo. E-cores are the grown up Atom we saw here.

Yes the GPGPU folks were winning. They had a huge head start, were the default option. And Intel was having trouble delivering nodes. So yes, Xeon Phi was getting trounced for real reasons. But they weren't architectural issues! It just means the Xeon Phi premise was becoming increasingly handicapped.

As I said I broadly agree everywhere. Your core point about giving the market more of what it already does is well taken, is a river of wisdom we see again and again. But I do think conservative thinking, iterating along, is dangerous thinking that obstructs us from seeing real value & possibility before us. Maybe Intel could have made a better ML chip than the GPGPU market has gotten for years, had things gone differently; I think the industry could perhaps have been glad they had veered onto a new course, but the barriers to that happening & the slow down in Intel delivery & the difficulty bootstrapping new software were all horrible encumberances which were rightly more than was worth bearing together.

I don't thing anybody seriously considered Phi's for generic compute or something.

Most experimenters saw it as a way to have something GPU-like in terms of raw power but with no limitations charateristic of SIMT's. Like, slightly different code paths for threads doing number crunching or something.

But it turns out that it's easier to force everything into a matrix. Or a very big matrix. Or a very-very-very big matrix.

And then see what sticks.

Why are we not also talking about memory bandwidth? Personal opinion: this is the key. The latest Phi had about 100 GB/s in 2017. The contemporary Nvidia GTX 1080: 320 GB/s.

When CPUs actually come with bandwidth and a decent vector unit, such as the A64FX, lo and behold, they lead the Top500 supercomputer list, also beating out GPUs of the day.

Why have we not been getting bandwidth in CPUs? Is it because SPECint benchmarks do not use much? Or because there is too much branch-heavy code, so we think hundreds of cores are helpful?

Existing machines are ridiculously imbalanced, hundreds of times more compute vs bandwidth than the 1:1 still seen in the 90s. Hence matmul as a way of using/wasting the extra compute.

The AMD MI300a looks like a very interesting development: >5 TB/s shared by 24 cores plus GPUs.

Hence why " they botched it".