| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aseipp 919 days ago

Some observations:

- Very bad performance at existing x86 workloads, so a major selling point was basically not there in practice, because extracting any meaningful performance required a software rewrite anyway. This was an important adoption criteria; if they outright said "All your existing workloads are compatible, but will perform like complete dogshit", why would anyone bother? Compatibility was a big selling point that ended up meaning little in practice, unfortunately.

- Not actually what x86 users wanted. This was at the height of "Intel stagnation" and while I think they were experimenting with lots of stuff, well, in this case, they were serving a market that didn't really want what they had (or at least wasn't convinced they wanted it).

- GPU creators weren't sitting idle and twiddling their thumbs. Nvidia was continuously improving performance and programmability of their GPUs across all segments (gaming, HPC, datacenters, scientific workloads) while this was all happening. They improved their compilers, programming models, and microarchitecture. They did not sit by on any of these fronts.

Ironically the main living legacy of Phi is AVX-512, which people did and still do want. But that kind of gives it all away, doesn't it? People didn't want a new massively multicore microarchitecture. They wanted new vector instructions that were flexible and easier to program than what they had -- and AVX-512 is really much better. They wanted the things they were already doing to get better, not things that were like, effectively a different market.

Anyway, the most important point is probably the last one, honestly. Like we could talk a lot about compiler optimizations or autovectorization. But really, the market that Phi was trying to occupy just wasn't actually that big, and in the end, GPUs got better at things they were bad at, quicker than Phi got better at things it was bad at. It's not dissimilar to Optane. Technically interesting, and I mourn its death, but the competition simply improved faster than the adoption rate of the new thing, and so flash is what we have.

Once you factor in that you have to rewrite software to get meaningful performance uplift, the rest sort of falls into place. Keep in mind that if you have a $10,000 chip and you can only extract 50% of the performance, you more or less have just $5,000 on fire for nothing in return. You might as well go all the way and use a GPU because at least then you're getting more ops/mm^2 of silicon.

1 comments

jauntywundrkind 919 days ago

I don't disagree anywhere but I don't think any of these statements actually condemn Xeon Phi outright. It didn't work at the time, and doing it with so little software support to tile out workloads well was a big & possibly bad gambit, but I'm so unsure we can condemn the architecture. There seems to be so few folks who made good attempts and succeeded or failed & wrote about it.

I tend to think there was tons of untapped potential still on the table. And that a failure to adopt potential isn't purely Intel alone's fault. The story we are commenting on is about the rest-of-industry trying to figure out enduring joint strategies, and much of this is chipmaker provided, but it is also informed and helped by plenty of consumers also pouring energy in to figure out what's working and not, trying to push the bounds.

Agreed that anyone going in thinking Xeon Phi would be viable for running a boring everyday x86 workload was going to be sad. To me the promise seemed clear that existing toolchains & code would work, but it was always clear to me there were a bunch of little punycores & massive SIMD units and that doing anything not SIMD intensive wasn't going to go well at all. But what's the current trend? Intel and AMD are both actively building not punycores but smaller cores, with Sierra Forest and Bergamo. E-cores are the grown up Atom we saw here.

Yes the GPGPU folks were winning. They had a huge head start, were the default option. And Intel was having trouble delivering nodes. So yes, Xeon Phi was getting trounced for real reasons. But they weren't architectural issues! It just means the Xeon Phi premise was becoming increasingly handicapped.

As I said I broadly agree everywhere. Your core point about giving the market more of what it already does is well taken, is a river of wisdom we see again and again. But I do think conservative thinking, iterating along, is dangerous thinking that obstructs us from seeing real value & possibility before us. Maybe Intel could have made a better ML chip than the GPGPU market has gotten for years, had things gone differently; I think the industry could perhaps have been glad they had veered onto a new course, but the barriers to that happening & the slow down in Intel delivery & the difficulty bootstrapping new software were all horrible encumberances which were rightly more than was worth bearing together.

link

vkazanov 919 days ago

I don't thing anybody seriously considered Phi's for generic compute or something.

Most experimenters saw it as a way to have something GPU-like in terms of raw power but with no limitations charateristic of SIMT's. Like, slightly different code paths for threads doing number crunching or something.

But it turns out that it's easier to force everything into a matrix. Or a very big matrix. Or a very-very-very big matrix.

And then see what sticks.

link

janwas 918 days ago

Why are we not also talking about memory bandwidth? Personal opinion: this is the key. The latest Phi had about 100 GB/s in 2017. The contemporary Nvidia GTX 1080: 320 GB/s.

When CPUs actually come with bandwidth and a decent vector unit, such as the A64FX, lo and behold, they lead the Top500 supercomputer list, also beating out GPUs of the day.

Why have we not been getting bandwidth in CPUs? Is it because SPECint benchmarks do not use much? Or because there is too much branch-heavy code, so we think hundreds of cores are helpful?

Existing machines are ridiculously imbalanced, hundreds of times more compute vs bandwidth than the 1:1 still seen in the 90s. Hence matmul as a way of using/wasting the extra compute.

The AMD MI300a looks like a very interesting development: >5 TB/s shared by 24 cores plus GPUs.

link