Hacker News new | ask | show | jobs
by mktmkr 2506 days ago
The historical context here is that AMD once had a monopoly inside Google's datacenters and pissed it away by shipping the horribly broken Barcelona followed by the not very broken, but also not very fast, Istanbul. This is a return to form for them, after a decade of poor form. The important thing for an operator like Google or Amazon is pricing power. As long as they can brandish a competing platform under the nose of Intel sales reps, they can get a better deal from Intel regardless of which they really prefer. You may have noted that a few years ago Google was showing a POWER platform at trade shows. That has the same purpose of putting Intel (and AMD) on notice that they have the capability to port their whole world to a different architecture if needed.
6 comments

Sorry it is just not true that Google was AMD exclusive. During that time Google had a I/A release cycle where every other hardware platform cycled between Intel and AMD. I will give you that AMD had a huge issue launching and I still have my "Argo" bag that I got for helping with the hardware qualification trials, but Google was no way AMD exclusive before that.
You're right. There were also not a lot of AMD boards at first, to the point that services had to prove "Opteron worthiness". Unless your load tests showed improvements of 70% or faster, presumably what the most important products were seeing, you couldn't run on them. Those were the days. The next generation Intel systems weren't a great improvement, but they still ended up being built in large numbers.
> Google had a I/A release cycle where every other hardware platform cycled between Intel and AMD.

Was the I/A cycle intended just to keep Intel and AMD on their toes, knowing that Google's data centers and software seamlessly supported both platforms? Was there any technical benefit, other than ensuring Google's software was portable?

There was a delivery benefit. At the time we were one of the largest server manufactures by volume. Generally we were on the same scale as Dell or HP. Buying from only a single vendor had serious cost implications sure, but there was also a pure volume issues. If Intel can't get us 50k of a given chip fast enough we can fill capacity needs with an AMD platform instead. That was the goal at least.
AMD is not just competitive, it is better than Intel. Thus Google should adopt it and role it out, and faster than any other cloud provider. This will win them customers. I want 256 threads per machine at competitive prices.
AWS has already rolled out EPYC instances.

https://aws.amazon.com/ec2/amd/

Those are the previous generation, Zen 1 EPYC CPUs, whichn were rolled out on AWS back in November.
Don't forget power consumption. Electricity costs are probably just as big as a factor as Cperformance when it comes to the number of computer Google has.

IBM quoted them as willing to switch to POWER if they could save 10% in energy costs

AWS has AMD machines. Amazon claims they perform worse than the Intel machines they use, and they are priced lower as a result.
T3 are intel, newer than the T2, and perform worse than the T2. T3a are AMD and perform on par/slightly better than T3 for less cost. (From my own testing. Not a claim I can backup this is just my observation)
I thought they were cheaper due to the better energy efficiency. Less electricity means less cooling required, double whammy!
Where did you hear that? We're running a handful of m5a instances with fantastic performance. I figured they were priced lower because they're cheaper to purchase and operate.
Depends on the use case. Intel is still top for many games and some apps like Excel/Photoshop.
"roll"
Very few developers are prepared to write code that can efficiently use 256 threads / machine. At that level, cache coherency becomes a real and non-trivial problem.

In most cases, I suspect developers will see improved wall-clock times with substantively worse FLOPS/watt. Good for developers, bad for data-centers.

«Very few developers are prepared to write code that can efficiently use 256 threads / machine»

This junk justification has no longer been relevant for years. Most developers don't care because (1) they rely on core applications that are already multi-threaded (web servers, SQL engines, transcoding, etc), or (2) in today's age of containers, VMs, etc, it doesn't matter to them. Now we scale by adding more containers and VMs per physical machine. Bottom line, data centers always need more cores/threads per machine.

Correct, if you partition a 256 core machine into 32 virtual 8-core machines partitioned by their NUMA architecture - you are relatively unaffected by core count (minus the consequence of some scheduling algorithms not tuned for N > 8).

Unsure what the percentage of VM's that use no time sharing or oversubscription is though.

Most devs I know are creating async workloads which don't require cache coherency, as they use parallelism to parallel process separate requests and workloads. I can see things being pretty linear in that sort of space.
They are not linear unless all requests take an identical amount of time OR the system is not oversubscribed (common in many workloads) - and even then, the current linux CFS scheduler has a complexity of `O(log N)`.

When you have variable length requests, you will find cores will not always be balanced, it is simply a statistical reality. And in those cases, the kernel will have to migrate your process to a different core, and if you have 256 cores, that core might be really far away.

Except that they are typically not. The Zen architectures are NUMA and controlling where memory is allocated is key to decent threaded performance. You may even have to do seemingly counterintuitive things like duplicating central data structures across nodes and other tricks from the distributed systems playbook.
Epyc 2's memory layout is not like Epyc 1. Epyc 2 is very simple.
Yup everything is equally slow now. Kinda sad, but the original NUMA design was treated as a glass half empty situation instead of AMD letting people maximize performance. This change lets them avoid the bad press and everyone is happier despite the final design being slower than it could have been.
I suspect cache coherency doesn't mean what you think it means. It's a hardware feature.

But yes, writing correct and performant highly parallel code is difficult & error prone, often prohibitively so.

Must be strange to work on a huge technical project knowing your work will likely never be used and is there primarily to put pressure on someone else to lower their prices.

I guess you get paid, and can think of it like a hobby project for your own technical chops. But still.

This is my favorite kind of work, because there's no chance you'll ever get called by an irate customer.
Been there, done that.

"Upper management want us to be able to offload burst capacity to AWS, MS, Google or other public provider, do what you can to make it work but I reckon in-house can beat them on pricing"

Six months later -

"Congrats, good work! We showed them, we're getting a new data centre!"

A lot of people love working on more esoteric technical things and get more satisfaction from the intellectual component than its direct utility (I certainly feel this way about some things, though trans-architectural portability is not one of them). I would imagine this type of person is better-represented in this sort of field.

In addition, this kind of work does indirectly help keep Intel competitors viable, which helps keep Intel in check for everyone. Stuff like that is pretty exciting in its own way.

It’s an interesting concept to me. I do have a lot of “hobby work”, which is still meant to be used eventually. I just don’t apply a timeline, which enables me to focus on correctness.

Then I have my professional work, where the timeline is the primary focus, and correctness can only be pursued where it moves the timeline forward.

These projects you’re talking about are an interesting mixture. There is still a timeline that must be hit, because you need to do your demos, and you need to be ready to shift to a production timeline if negotiations go south. But since there are no customers the business model isn’t changing. And you don’t need to do any polish. So you can stay focused on the raw architectural problems.

It’s like my hobby projects in that you can focus on readiness over completeness, but there still is some timeline pressure.

Interesting to think about. Strange for me, but I guess it’s every day for others!

Just imagine how nice a project is when you'd don't have to worry about supporting it for years.
Projects exist to expand revenue or cut costs. In my experience the revenue expansion ones are far more likely to “never be used”
Tons of neat stuff gets built and then not used for a number of reasons (political, pricing, scaling, etc). It doesn't meant building it isn't worth it though.
Your right about pricing power. I think the net benefit from new competitive AMD chips will be to force Intel to adjust its premium pricing. Personally when I'm looking for personal needs or pricing out a build for use in work (basically not quite "big" data, but data about as big as can be done on a single high-end workstation) I don't really care about brand. I care about cost-performance factors, and component compatibility. I'll happily choose AMD if they're a 20% discount over Intel
I think STH's writeup does a particularly good analysis here: https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...

The second is important. Customers need to adopt AMD EPYC. To our readers, it is important when you get a quote to at minimum quote an AMD EPYC alternative on every order. More important, follow through and buy ones where Intel is not competitive. If AMD EPYC 7002, with a massive core count, memory bandwidth, PCIe generation and lane count, power consumption, and pricing advantage cannot take significant share, we are basically done. If AMD does not gain enormous share with this much of a lead, and easy compatibility, Intel officially has a monopoly on the market and companies like Ampere and Marvell should shut down their Arm projects. If AMD does not gain significant share, there is no merit to having a wholistically better product than Intel.

As for bettering cost-performance, the full review gives plenty of context that the new Epyc 2's soundly beat out the current Intel Xeon lineup (often by 2X or more), but I think AMD is also doing what they need to do get marketshare (while still raising their ASPs):

When it comes to the top-bin SKUs, the value proposition is simple, just get a higher-end SKU and consolidate more servers to save money. AMD is extracting value for the higher-core count SKUs. For AMD a chip with 64-cores, 256MB L3 cache, 128x PCIe Gen4 lanes at just under $7000 compares favorably when its nearest Intel Xeon competitors are two Intel Xeon Platinum 8280M SKUs (M for the higher-memory capacity) that run just over $13,000 each. AMD at around $7000 is essentially saying Intel needs to start their discounting at 73% to get competitive, and that is not taking into account using fewer servers.

On the AMD EPYC 7702P side, AMD is calling Intel that if it wants to be performance competitive, it needs to discount two Platinum 8280M’s by 83% plus the incremental cost of a dual-socket server versus a single-socket server. This is a big deal.

What was horribly broken in these Opterons?
The Barcelona chips initially had a pretty nasty bug in the TLB. AMD stopped shipments for about 5 months so they could put out a new stepping with the bug fixed. The Istanbul chips arrived a few months after Intel's Nehalem, which is where Intel caught up with features like the on-die memory controller and started roughly a decade of unchallenged performance lead.
The TLB bug (Errata 298, doc 41322 if you really care - while the processor was attempting to set the A/D bits in a page table entry, an L2->L3 eviction of that PTE could occur) was one of a great many things wrong with that chip.

* A number of errata (not just 298) delayed full production, sapped performance, or negatively impacted idle power. Take a look at doc 41322, DR-BA step for many samples.

* It was late and didn't achieve performance targets; it missed clock rate targets and 2 MiB L3 was insufficient.

* Intel delivered a very compelling server part (Nehalem) during the lifecycle of family 10h.

How do you measure performance per watt? FLOPS/watt? I don't think FLOPS is worthy measure of chip performance, since it doesn't take the L1/L2/L3 cache size into account.

Are there performance benchmarks that are designed to measure server application performance?

There are plenty of server performance benchmarks and even one for power efficiency: https://www.spec.org/power_ssj2008/

Google probably has a whole team internally to benchmark their own applications on different hardware.