Hacker News new | ask | show | jobs
by c7DJTLrn 1295 days ago
Cloud cost optimisation is underrated. In the companies I've worked in nobody has really given a shit (at least not under normal economic circumstances). In the industry there's a strong avoidance of ARM compute instances for no good reason. If I were building from scratch today I would definitely go with Graviton.
14 comments

At $dayjob I found an unused box in the cloud running an expensive database engine. It was idle for months, created to be used by a consultant on a project that had wound up. The consultant had quit his consultancy on top of this.

I was told under no uncertain terms not to even think of touching this VM because “the budget has been approved”.

I was shocked at the flagrant waste of money and assumed it was a one-off aberration.

Nope, for months afterwards I kept hearing the same refrain from manager after manager, from product owners and dev team leads.

“Don’t touch! We fought hard for this budget! You’ll take it from our cold dead hands!”

Eventually I soured on the whole idea of cloud cost optimisation a service for unmotivated third parties and gave up on the whole notion.

FWIW, in these situations you're better off proposing:

"I'm going to reuse this VM, to help our ... fleet scale better."

That way your management continues to use their allocated budget, and your real prod systems work slightly better (also will eventually require less additional $ to scale up - helping the company i.e. shareholders).

The thing to remember:

You would assume all middle management really manages are a top line and a bottom line. Numbers related to their KPIs/OKRs are roughly a top line, and numbers related to their resources (humans and cloud infra budget) are roughly their bottom line.

The reality: Middle management's resources (humans and cloud infra budget) are not their bottom line. Middle management gets rewarded (promoted) when they have "enough scope", scope has roughly always been defined by number of people (it now also includes things like cloud cost budget). As such middle management has to say "we need to do more with less", but they are promoted based on these numbers going up!

Is this reward structure in the best interest of companies (i.e. customers and/or shareholders)? No, neither. Is there a better system? Not yet. Is the reward structure created by middle management for middle management? Likely.

So in the meanwhile, if you don't want to become unmotivated, might as well work within the current reward structures.

> help our ... fleet scale better."

Heh, I tried that too! I found a lot of setups using old HDD disks that were at 100% of their IOPS limits and CPUs that were idle. When they were built, Azure didn't have Premium SSD, so that's forgivable. I offered to rearrange where the expenditure goes, such that they have newer and faster CPUs, Premium SSD, but fewer cores which means a reduction in licensing costs. Cost neutral while improving their performance and capacity.

That got a very firm "Nope! Nope! Nope!" from most (but not all) teams as well.

> As such middle management has to say "we need to do more with less", but they are promoted based on these numbers going up!

I came to the same conclusion. Many project managers or product owners introduced themselves in the first meeting by proudly proclaiming the huge size of their operation. I.e.: "The system I'm responsible for is a multi-million dollar project with a small army of developers!"

I've also noticed that there's a tendency to exponentially blow out complexity of what ought to be a trivial system for the same reason. Something that could be static HTML on an S3 bucket or Azure Storage Account turns into a microservices monstrosity draped across three clouds and four external SaaS services.

Those resumes don't pad themselves.

It’s as though they interpreted David Graeber’s Bullshit Jobs as a managerial playbook
Clever managers will repurpose that box.

One of my managers was a champion at that, taking scraps from everywhere for undercover projects. One of the few who managed to get things done in a highly bureaucratic company. Also helped by the fact he is good at picking competent people.

Imagine seeing your startup grow into a company where bureaucracy rewards department heads who waste money now to protect their budget so they can keep wasting it next year...
I think the main reason is "I want to run the same binaries locally that I run in the cloud," and it's a pretty valid one. However, it's also an expensive one sometimes.
Anecdotally, this is starting to shift with M1 MacBooks, Graviton is looking more attractive for precisely that architecture parity reason for teams using majority M1 devices.
Yeah if only. Our ops people are too uneducated to be able to deploy anything Apple. Literally there are armies of factory pressed Windows monkeys but nothing in the Apple space.

Note to apple: please start concentrating on the enterprise sector. We're dying over here. My Dell weighs 3x my personal M1 MBP, has a shitty keyboard with keys designed for Borrowers, the battery lasts 8 minutes and it reduces my sperm count if I put it on my lap. It feels like I have a ball and chain around my ankle 24/7. My only escape is WSL2 which is broken as fuck as well (can't run services, cron jobs, X problems etc) and we can't install a simple non-WSL VM on the node because Device Guard requires hyper-v to be enabled excluding sensible and pure VM options like VirtualBox. Docker for windows is a comedy of errors too.

For what it's worth, I work in a large 30,000 employee company. Everyone gets a Windows machine by default.

6 years ago our department of 200 people "went rogue" and started provisioning Macs, because it was the only way we could hire developers and provide a good developer experience for the work we were doing.

This took some convincing, but it was possible. We agreed to be unsupported by the in-house help desk, but we had 2 people in IT that supported us for provisioning and fixing machines, and sorting out a small amount of required enterprise software like a Cisco VPN client and some fleet management background agents.

Otherwise, we self-organized support over Slack and in-office and also made use of Apple's business support directly.

As of earlier this year, our department is now over 600 people, and we've given our internal IT enough incentive to officially support Macs, which they now do, alongside Windows.

They use some kind of MDM software to manage and update and monitor our Macs the same as they do Windows.

There are also now additional much larger teams in the organization exploring Mac adoption where it makes sense for their developers too, and we could soon have thousands of Macs in use.

So it's definitely possible, even if you have to start small.

Unfortunately we're in a regulated industry so getting a PO signed off for one Mac is nigh on impossible without involving the entire corporate machinery.
We are also a highly regulated industry. Not finance but pretty close.

I sympathize with you though, it sounds like there's no will to do it, which sucks :(

Are your developers forced to use laptops?
Our company issues laptops yes, because everyone can choose to either work at home or in the office.

That said, we also supply external monitors, keyboards and mice for anyone who wants one, and every worker has a discretionary budget to spend on home office accommodations (chair, standing desk, etc).

I also suspect if a developer specifically asked for a desktop computer and could (lightly) justify the need for it, we would get them a desktop, and a laptop.

Unfortunately yes. I would rather a desktop but they don't know how to pay half as much for the same specification. The desks in the office are all equipped with docks and expensive WiFi mesh driven by COVID mentality so that is the status quo.

Just send me a fucking workstation. Nope too hard.

Do you propose they carry desktops back and forth from their office and to conference rooms?
> Note to apple: please start concentrating on the enterprise sector. We're dying over here.

Enterprise sales are where the customer is not the user. Apple does best when the user is the purchaser.

Also, I know for a fact that Macs are well supported at scale by many large tech companies including my own.

FYI, with WSL2 1.0, you can finally enable systemd, so you can run services and cron jobs.
Thanks for the tip off. I will look into this tomorrow. This is why I'm here. The distributed consciousness of HN is a wonderful problem solving engine :)
Never user Docker for Windows, but Docker for Mac has not been great lately either.

Granted, macOS changes a lot between each release but our company is paying for Docker Desktop licenses and the experience has really been disappointing.

Can you use a plain Hyper-V VM? i.e. with Hyper-V Manager
Tried that but unfortunately there are some painful addressing and routing issues when you are subjected to when dealign with a corporate always on VPN. Ergo you can't actually contact clusters which you have to admin via kubectl.
What's stopping you from running kubectl.exe or Cygwin? Frankly, I still think Cygwin's better than WSL in many ways.
That’s the transition we went through. Our dependencies are / were pretty weird so the transition took a bit of effort - I suspect more complicated than many people would have to go through.

We all use Macs at work so we knew it was a matter of time before we were on ARM. I’m glad we made the transition. M1 airs are a delight to work with and Graviton machines are great bang for buck.

Valid why? Do you not trust compilers? Is it infeasible to (at least occasionally) run automated tests on cloud instances? Personally I've been pretty used to quite significant differences between local and production environments - it's rarely an issue, and I don't remember CPU architecture every being one. Things like timezones or firewall restrictions/ network differences (including talking to 3rd party APIs with IP whitelisting) are far more likely to cause problems.
Completely agree... the only exception I've run into is that for small operations build tooling often doesn't work well with arm64.

EG: GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64... likewise aws-cdk literally could not run an arm64 fargate ecs deployment for months after support was added (They simply did not support the required attribute in the container definition).

I would love to see this change as I've had nothing but great experiences with graviton for virtually anything arm supported.

Are you building on arm64 natively or via qemu. A few mins vs 35 for the same roughish spec of CPU, seems a bit off, even with optimisation considerations.

I've found arm64 builds on amd64 take longer when using one build context/arch (but doing multiple platforms), but that's as it's being emulated.

It's the oppostite on my M1, the buildx amd64 takes longer.

oh it was with buildx (which uses qemu) as GitHub Actions run in x64. I was showing a specific example of arm64 build tooling challenges small startups encounter (Github Actions lack of arm job runners in this case). My arm64 builds on arm64 architecture scream.
You can do a self hosted runner on arm though, even on a Raspberry Pi (if inclined), so not sure it's prohibitively expensive to a startup, could even afford a spot graviton :D

I've got to say though, I've not experience that delta in build times, even emulated on my machine.

> GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64

What type of container, and on what runner? That has not been my experience at all, a cross-compiling buildx build with Python and a bunch of libraries takes only slightly longer for arm64 than it did for x86.

My favorite way to watch this slow down is to introduce some node workloads into the build workflow.
First graviton is not magic. We switched our main service, which is a nodejs monolith, and did not get any cost improvement (we had to add more instances to handle the same workload, which ended up being equivalent cost wise). There are certainly use cases when it's better, but it doesn't seem to be the only and obvious choice for all use cases.

Second our laptops and our CI are amd64 machines, and being able to run the same docker images in prod and locally is nice, and not having to build the image with qemu on the CI is also good.

I don't mind cloud-ARM, but there definitely are good reasons not to use it (which of course don't apply to everyone)

If I started to build today I'd build and host my own servers, or go with servers from ionos. Cloud is very expensive.
I just worked on a massive "optimised" cloud migration like you've never seen. We moved from multiple DCs to AWS and the costs are approximately 8x what the pre-migration costs are. We were realistically expecting 2x which gave us some regional agility and was expected but the unconstrained growth and misunderstanding of the cost model was terrible. It's designed to be so convoluted that you can't possibly estimate costs until you get the first bill at which point you are committed on a multi-month or year project. On top of that the assumption at the time of development is the cost is someone else's so the sprawl since the migration is dangerous which means we cannot leave ever now we've embraced the PaaS options.

The whole proposition relies on the idea of a sunk cost being accepted.

So yes, back to servers please. IaaS should be the maximal offering that is accepted by a business from a risk perspective unless the tool or technology is disposable in a 6 month window. There is space there for gains. PaaS hell no.

Edit: worth mentioning that AWS support is somewhere near dire. We've had issues with multiple services and despite being a VERY high roller with enterprise support we can't get anything fixed in any reasonable time. It's just someone else's crap you're using and they aren't any better at it than you are, just adding lead time to any issues. In some cases I've had to actually call out complete bad implementations that break function guarantees provided by open source projects (I can't logically warn people away from services as it's pretty obvious who I am if I do). One rule I've developed is that if it's not a core project: S3, EC2, EBS, ALB etc then it's probably a commercial liability in some way. There are no people working or with any knowledge on some major bits of AWS infra.

>We moved from multiple DCs to AWS and the costs are approximately 8x what the pre-migration costs are.

8x?? that's crazy, what where you doing wrong then?

Everything, all at once.

SMEs can't reliably manage that transition with any skillset and still deliver a product at the same time.

I’m very curious to hear more details!

Did you use reservations to reduce costs?

Was it a lift and shift with VM configs staying as-is? (I’ve seen a lot of empty 1TB “app” drives burning money in the cloud!)

You complain about PaaS services, but I can’t imagine 8 data centres worth of stuff being converted to PaaS in hurry!

Cost savings are mostly consolidation, scaling down stuff we don't need (we have peak hours) and migrating stuff to kubernetes and packing it tight.
Rent servers: yes. Host your own: maybe. You can run a whole-ass company on 2 $70/mo servers from Hetzner (and some B2 for durable storage) while you figure out whether you have a market or not.

Like there's just no point in coloing when you're small because either all non-server bits will cost you for no reason or you're using something managed which is just cloud but more annoying.

Switching to Gravitron isn't an automatic cost savings. Everything is optimized for x86. It maybe cheaper, but significantly slower. We've been trying to migrate for the last year for both cost saving but also we switched to ARM based laptops.
This. We have three people entirely dedicated to reducing costs.

As for avoiding ARM, we do only x86-64 because corporate security policy demands that we have Windows laptops so that some box ticking overlord can fill out a security policy compliance form. That means we're stuck limping along with docker and WSL2. Every single engineer in the org has an arm64 machine at home already and wants a proper computer at work, which can ironically work in the same policy framework if anyone gave enough of a shit to deal with it.

So that's why we don't use Graviton; corporate security policies. Our customers will just have to eat the price hikes.

Builds in production shouldn't be built using developer laptops. I think you're approaching this wrong. You can build and test on x86_64 laptops all day if you want and still easily deploy to arm64 servers.
We don't do production builds on laptops.

But it's important that all builds are 100% reproducible on all build targets and that includes non docker ones. That is much more difficult if you have to cross compile stuff. We can barely manage one architecture.

You can cross-compile to ARM using Docker on Windows.

See: https://www.docker.com/blog/multi-arch-images/

Agree, I've gone to Graviton instances by default for RDS and ElastiCache (run and own a DevOps consulting company). The big problem that I continue to deal with is native arm64 Docker containers (if you a cool kid running containers / Kubernetes). For example, the very popular Bitnami charts don't support arm builds even though the community has been screaming for support.
If I started to build today, I'd definitely go for Hetzner Cloud. There is zero possibility that I get surprised by a large bill.
Too bad Hetzner doesn't have cheaper Arm servers available. Their Ampere pricing is not really convincing.
Who cares? Their x86 servers are cheap enough to not care about ARM.
I feel it's currently in beta, I've tried it and apparently I can't create more than a few instances because my account is "too new", without a clear way to remove that limit so you're right, can't have a large bill if you can't even create 10 instances.
Did you try writing them and asking them to increase the limit?

No cloud provider will give you the option to create as many instances as there are available ones, they all have limits from the get-go. Usually you have to write them/fill out some form if you want to go above the standard limit, Hetzner Cloud as well.

I haven't, mainly because of this warning on their Limits page:

> Your account is too new to request a limit increase. Please note that we generally do not answer questions regarding limit increase on the telephone.

It's very easy - you just write to them and ask to increase the limit to whatever you need. You need to do so in writing, not on the phone.
Let's go even further - "Cloud cost is underrated"
> In the industry there's a strong avoidance of ARM compute instances for no good reason.

Not no reason; it adds work and risks incompatibility. Now, that work might be relatively small, and most software these days is compatible with aarch64, but compared to amd64 (which is the de-facto standard, already supported by everything, the default without needing to set anything up) it's still something, and businesses are risk-averse.

We are building everything for arm and I've surprised if other large companies aren't optimising for it.
I agree. Until it becomes an issue, where everyone runs screaming like chickens, literally nobody gives a shit.
Because ARM perf was far far from being on part with Intel / AMD, also you need to be able to compile on that arch.