Hacker News new | ask | show | jobs
by joncfoo 1200 days ago
At $previous_job we shifted a large workload from Intel to Graviton which was projected to save ~$1.7m annually while keeping roughly equivalent performance (after some tuning).
2 comments

> which was projected to save ~$1.7m annually

Did it?

I've seen first hand validation on massive workloads moving to Graviton based instances. This includes low latency high TPS java services and offline big data compute on EMR.

All combined the hype is quite real. Heck, even moving an intel based service to newer nitro based EC2 instances resulted in a drastic performance improvement. Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.

m5 instances use the Nitro system. In addition, m5.24xlarge is a quite quirky instance type: It uses 2 CPU's with 24 cores each in a NUMA configuration. Half of the RAM is attached to each CPU, and access from the other CPU is much slower. In addition, the CPU cores use a microarchitecture from 8 years ago, meaning the cores are quite slow in practice.

All of this means that a lot can go wrong when running code on those instances, resulting in lower performance. It is either advised to run separate processes on each NUMA domain, or use NUMA aware code (which Java almost never is). In addition, the code (or the system) should be highly scalable to multiple CPU cores.

In addition, the cores are old enough to suffer from Spectre/Meltdown related patches/workarounds, decreasing especially syscall performance.

In our case the instance type is about the only workhorse for the given job. High TPS (scales well to the core count) and needs a large on-disk configuration for low latency key value retrieval of data deployed on disk.

I did slightly misspeak on the instance move having seen your reply. We moved from m5.24xlarge to m6i.16xlarge. Sorry for the confusion.

That said, you shared some interesting information. I'd love to read up more on this, any specific place I can dig in a bit deeper regarding the finer specifics of these instance types and architecture?

Just to note: m6i instances are Intel-based.

As for getting information on AWS instances, the best way in my opinion is just to spin up the instance and look up which exact CPU model it uses. Then you can go for example to WikiChip (https://en.wikichip.org/wiki/WikiChip) to see more information about the CPU. Other good sources include Anandtech (for example https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...) and Chips and Cheese (for example https://chipsandcheese.com/2022/05/29/graviton-3-first-impre...).

Things like NUMA configuration can be inspected with tools like numactl.

Yes, I'm aware. The service in question wasn't easily able to be moved so we moved to m6i which isn't ARM based but does leverage nitro. We saw substantial improvements in that configuration too. Not sure what is different because you said m5 use nitro as well but my assumption was m6i with reduced hypervisor overhead from nitro was why we saw improvement.
> Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.

I wonder if this is actually an Intel issue or if there are some other optimizations at play, such as in the virtualization layer.

Because at one point I wanted to try Jetbrains' new "gateway" product, which basically runs a remote IDE and only shows the GUI locally. I was curious on one hand, but I also wanted a machine with a bit more oomph for my occasional compilation needs (rust on Linux, fwiw). I was really unimpressed, the c6i was comparable to my local slim laptop running an 11th gen i7u part. My similar slim AMD 5650U laptop is actually faster. IIRC, the c6i.metal wasn't particularly faster on this kind of single threaded work.

The difference is in the pricing and the fact the core are "whole"

On intel aws, you pay per HyperThread. On Graviton, you pay per core.

But on this kind of workload and with modern schedulers, HT bump is rather limited. So in practice you are paying twice the price for the same number of cores.

This is the biggest contributing factor to that difference and i keep being surprised noone mention it.

Not sure what you're talking about - in aws x86 you pay by core (well as much as pay by core with arm anyways, you can't just buy a 1gb server with 64 cores)
AWS x64 'cores' are the virtual cores you see on hyperthreaded CPUs and map 2:1 to physical cores on the CPU, but the AWS ARM offering doesn't have hyperthreading, so the virtual cores map 1:1 to cpu cores.

You can disable hyperthreading on the x64 instances at the cost of halving the number of cores you have available in the instance that you paid for.

"Intel is in trouble" since the calxeda days and ARM is still insignificant to this day.
> ARM is still insignificant to this day.

Is it? The phone you use probably uses ARM. If you buy a mac now, it's probably gonna be ARM. It's very much different from the calxeda days!

OnlineOrNot (my company) saved about 30% moving DBs from Intel to ARM, so it sounds legit
1.7 million may be a lot or may be a tiny drop depending on your overall (like for like) spend. The % saved is the important metric here not the aggregate dollar amount.

Did you replace 5m of EC2 with 3.3m of EC2 and save 1.7m (impressive) or did you replace 50m of EC2 with 48.3m of EC2 (not really so impressive)?

> or did you replace 50m of EC2 with 48.3m of EC2 (not really so impressive)?

I'm failing to comprehend how that's not impressive. Bean counters would still love this type of savings.

The same fixed amount of money is less noticeable in a larger organization. It's more likely that something more effective could have been done with the engineers' time.

On the other hand, if the larger organization hired an AWS specialist, as many do, the optimization might be "free" because the specialist wouldn't have been effective outside of their area.

I'd say exact value versus % can be more meaningful in many non-billion+ companies.

Understanding you saved $1M/yr means you're closer to profitability (I understand in the VC world all you really care about is % growth, but that's up for debate if that's how everything should be) or able to hire more engineers.