Hacker News new | ask | show | jobs
by goosehonk 2384 days ago
This comment seems kinda slanted. AVX-512 debuted on Xeon because datacenter operators asked for it. It does not “downclock a whole chip”, it gates the core where it is active and there’s not even that penalty on the current generation parts. “10nm” is marketing fluff which has little or nothing to do with actual semiconductor construction. “Chiplet” is also marketing-speak for “wow this memory topology is hard to program around “. Not sure they should feel too bad about missing that boat.

What Intel really should be worried about is the client side being their largest revenue segment. That’s a dead business, eventually. And the bets they made didn’t pan out: FPGAs aren’t popular because the people sophisticated enough to use them are also smart enough to tape out ASICs. IoT is not a thing.

4 comments

Uhh source on AVX512 not downclocking on modern CPUs? We benchmarked ML workloads on the newest chips the cloud had to offer and the slowdown was a significant problem because, as the parent comment said, it is very hard to reason about whether the benefits of vectorized ops will outweigh the the reduced clock speed. Sometimes it does and sometimes it does not - which is a major problem when you have to specify instruction set when you build the ML library from source.

Maybe you know something I don’t but that FPGA statement makes zero sense to me. The ASIC development cycle is measured in years - that’s why FPGA’s are valuable (and I thought they were relatively heavily used).

It is a major problem to figure out what instructions to use but it's a lot more nuanced than you seem to imply. In the first place you seem to assume that "not running at the max turbo speed that's printed in the marketing literature" is equivalent to "downclocking". However, there are a huge number of reasons why a core might not clock up, including the number of active cores on the package. The first Xeons that shipped with AVX-512 had turbo clocks that were 25% lower than the headline turbo clocks, e.g. 2400MHz instead of 3200MHz. This is still pretty good, and the base clock is 2100MHz.

With the newest Ice Lake processors ("10th generation") the all-cores-active, all-avx-512 max clock speeds are the same as max scalar clock speeds. You can try this out yourself with the avx-turbo program.

No, when we used MKL, the workload was slower and turning off MKL made the workload faster. The marketing is irrelevant - using vectorized instructions slowed down the workload in practice which is all that really matters. The Intel teams we were working with explained it as being due to the slower clock speeds caused by vectorized instructions. I don't really know, but it seems fair to assume that they do.

It will be interesting to test Ice Lake when they make it to the cloud, hopefully some time late next year, but until we can actually use Ice Lake, Sky Lake is what AVX512 will be judged on.

It's a good thing you measured it :-) Programs that do a little bit of 512x512 FMA mixed in with other stuff will not benefit from AVX-512 but can suffer from the heat it generates, or from the hiccup when the CPU turns the FMA unit on and back off.

Codes that can do a lot of 512b FMA consecutively will benefit very greatly, and pay a small penalty (up to 25%) in terms of throughput for everything else.

Codes that use non-multiplier stuff that's just marketed as AVX-512, like VBMI2, also benefit greatly and without any penalty.

People with AMD CPUs don't get a choice. Hard to see how this accrues to Intel's mistakes column.

It's not really an Intel mistake, but it is an Intel problem. In ML, the ASICs are coming. NVIDIA is pretty much guaranteed to maintain a leadership position in this space because their software layers are dominant. Intel's ML leadership position is quite tenuous because the killer ML features don't work quite well enough for the premium. MKL should be a solid moat, similar to NVIDIA's CUDA and CUDNN, but if it requires serious effort to get the benefits, it becomes more palatable to spend that effort on ARM-based servers or custom hardware like Inferentia which are meaningfully cheaper. Maybe Ice Lake will fix this, but Intel is running out of time to convince people that Intel chips should remain the first choice in ML.

AMD isn't relevant in this space AFAIK.

Djb’s article linked above answers this in detail. In short, vectorized instructions don’t have to hit ram to do things like addition, but they can heat up the processor beyond its ability to cool to operating temperatures. The clock speed drop is necessary to avoid overheating.

For more reading, check out cpu pipelining, as well as how vectorized instructions actually work. The performance benefit for well implemented vectorized instructions overcomes the clock speed hit by leaps and bounds, which is why mobile systems make such heavy use of them, for example.

FPGAs are in a tough place. Like OP said, most people writing RTL make asics, or at least an asic that's programmable. The FPGA target market is getting slimmer, since we have programmable Asics, like GPUs and tpus, that are as performant with easier programming. They will still serve a niche market, but the "write c++ and run on an fpga" will likely never take off.
I thought the main market for FPGAs was that period between "we have a problem that needs custom hardware" and "we have custom hardware being produced at the scale we need". I guess that's a relatively niche market?
There's also the market for high performance things that need to be in-service upgradeable. I understand mobile base stations are a significant customer for large FPGAs, to enable deployment of new standards revisions/modulation schemes without a truck roll.
Pretty much. The problem set they're useful for is low-latency high-throughput stuff, and/or connectivity to high speed digital signals, for things that there isn't an existing custom solution and where you don't care about area or power consumption. That's not a huge market.

We do use them at my employer, a multinational chip company - but only in very small numbers, like one $50k board gets shared around project groups who use it for a few weeks each. Most of the work is done in simulation.

Chiplets are more significant than you credit them for. They allow higher yields and make the production economics much more favorable for AMD, whereas Intel is throwing out a lot more silicon.
Yield might be part of it, but I'm sure intel can ship partially functional chips with a core here/there disabled.

Another of the big advantages for AMD is that their products aren't reticle limited. The basic design lets them have a single design they bolt into dozens of configurations that scale larger than what intel can fit on a single die. Hence 64 "big" cores in a single socket.

There are likely other advantages too (cooling?) that partially make up for the longer more complex core->core latencies.

Do a simulation, chiplets extract multiples of revenue more than binning on failed functional units. Lots of functional blocks are NOT redundant, leading to the total loss of part. At these small feature sizes and massive chip areas, yields are down. Chiplets avoid this.
I don't have anywhere close to enough information to know what the actual yield numbers being experienced by AMD's products vs Intel's (you can probably count on 1 hand the number of people who know such things). For sure its much harder to make a perfect large die, which is part of why most of the 7nm parts are so small (or experiencing really low yields). But its so completely different. Intel is on a very mature process with a larger feature size, and so much of their large die chips _ARE_ consumed by things that can be disabled (cache slices, cores, etc) that the probability of landing on some critical portion of the die that completely junks it are probably fairly low or we would be seeing a glut in the lower core count parts too and intel doesn't really seem to be having a problem sourcing the upper mid range xeon parts.

Bottom line, I don't believe that intels product lines prices in any way reflect what the actual yield curves are.

Chips with problematic cores are sold as lower end chips. For the same production cost, you are getting less revenue - failure rate plays a big role in profit margins.
Vs throwing the whole die away because you don't sell enough systems that small?

Its hard to tell, but intel still has a strong markup on 24 core parts being sold from 28 core dies. Intel has often be "caught" down selling parts to protect their higher margin parts. (AKA they are selling parts with things disabled that work)

they weren't "Caught" - binning is a common practice in the cpu industrty. this isn't a problem
I wasn't talking about binning, I was talking about when you have binned at a certain level, but the product is sold under its capability because you want to maintain the illusion of scarcity of the better parts.

AKA its a perfect part, but its being sold with a couple cores disabled or at a frequency below whats its capable of.

Ah yes the anti-vexxer argument. A good overview on AVX-512 and the criticism can be found here: https://blog.cr.yp.to/20190430-vectorize.html
> AVX-512 debuted on Xeon because datacenter operators asked for it.

It deputed on workstation accelerator cards.

> It does not “downclock a whole chip”, it gates the core where it is active and there’s not even that penalty on the current generation parts.

It very much could thermally throttle more than the one core.

> “10nm” is marketing fluff which has little or nothing to do with actual semiconductor construction.

"10nm", even as a proper noun, is a very important component of Intel's woes right now. They aren't getting the yields they were expecting, a major competitor surpassed the for the first time ever (TSMC) and that's how AMD is killing them right now.

> “Chiplet” is also marketing-speak for “wow this memory topology is hard to program around “. Not sure they should feel too bad about missing that boat.

No, it's marketing-speak for "near EUV process nodes have terrible yields compared to previous nodes, and need smaller dies combined on a multi chip module to get anything worthwhile for an acceptable cost". Current EPYC chips are a single NUMA node again, but still chiplets. They are absolutely kicking themselves for not bucking the trend and going chiplet, because then they would have been competitive with TSMC for yield/area. Single chips is putting all your eggs in one basket, but splitting the dies means you throw away way less chips. (Another way out is FPGAs and GPUs that practically can bin off way more of the chip).

> And the bets they made didn’t pan out: FPGAs aren’t popular because the people sophisticated enough to use them are also smart enough to tape out ASICs.

FPGAs are very interesting in a post Moore's law world. Their ability to dynamically reconfigure makes them interesting in cases where ASICs don't make sense. High level logic can be treated like code from a continuous delivery perspective (like Alibaba does with their memcache like FPGAs sitting on RDMA fabric). Data can be encoded in combinatorial logic and treated like any other infrastructure deployments (like Azure does with their routing CAMesque logic in their SDN FPGAs). ASICs don't give you anywhere near that flexibility, even in a world where they're a commodity. Don't confuse their tooling immaturity for a lack of usefulness.

> IoT is not a thing.

It's very much a thing; once again just an extremely immature ecosystem. Once high end CPUs are commidities that can been shopped around from each of the fabs, IoT external customer designs will almost certainly be a very important revenue stream for Intel. A modern fab is nothing to sneeze at, basically only countries with $20B to spend will have one, so we'll be seeing one or two per continent. It won't make sense for anyone else in the US to compete. As for how that affects IoT, tiny nodes will be amazing for little smart dust chips once the capital investment of these end nodes has been paid off.

IoT is a thing .. that Intel failed to get into. They don't have anything that scales down that well. That market is dominated by ARM, implemented by all sorts of lower tier vendors like MediaTek.

FPGAs really need a tooling unlock to take off so they can be useful to people who haven't been on the ASIC design course.

> It won't make sense for anyone else in the US to compete

TSMC?

TSMC isn't a US company.
Correct, but they do a lot of fab work for other companies. It's not absolutely necessary to have your own fab to be competitive unless your volumes are huge... at which point you can afford it.

Apple have $245bn on hand, so they could have a dozen $20bn fabs if they felt the need. Bezos has $180bn and no idea what to do with it.

https://www.cnbc.com/2019/01/29/apple-now-has-tk-cash-on-han...

They could, but the answer to "what should I spend $20b on" for both pretty clearly isn't "a fab where at the end of the day you might get a fraction of a percent better costs on bare dies from the vertical integration versus just buying from TSMC or whoever"

Fabs are both a commodity, and take a high capital investment. Unless you have a geopolitical reason to create one and you get .gov kickbacks to make it happen, then there's way better uses of your money.

Thank God for this comment I couldn't upvote this enough, got into much more detail that I could have bothered to reply.

2nd Comment into the page and literally everything said in that were wrong.

Basically all that stuff you said about IoT has been said verbatim for decades and yet here we are. Remember the "SmartMote"? Neither does anyone else. By the way that was _also_ an Intel-funded project.