Hacker News new | ask | show | jobs
by l-p 1295 days ago
> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You're new to Azure I guess.

I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.

One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

11 comments

It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.

In my experience there are differences between clouds so while all have the same basic problem in practice some may be better than others. I've never had issues getting GPUs on AWS but GCP constantly has issues with GPU/TPU capacity.
Is this region dependent? In us-east I can’t get them to approve a quota for GPU instance families (G,P) for anything more than 4 CPUs. At one point they rejected my request citing “unprecedented demand”. Of course this is small time, just my personal account.

It is true I can get an instance most of the time, but not if I need >16GiB GPU memory.

We've been having the same problem getting GPU instances on us-east. Multiple week-long delays to escalate and talk to yet the next person up who can make a decision. It's a mess.
There probably are difference occurrence rates. We had to modify how our test suite provisions instances, since we used to regularly run into instance availability constraints on EC2 during the holidays.
I’ve occasionally seen some of the internal AWS capacity management dashboards, and they can frequently be operating very close to 100% on some resource types.
I worked on a project about a year ago where we would have a colleague in a different time zone start instances with 4 gpus because it would almost always be unavailable during regular work hours for us-east
It may be a risk borne by every cloud provider, but why does this only really happen to Microsoft among large providers?

As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.

Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.

There's all sorts of examples of AWS failing to be able to provide capacity too. Just do a search for "aws InsufficientInstanceCapacity" or similar. I remember Fortnite talking about capacity limits in relation to an incident, but I'm struggling to find the post-mortem I saw it in.

Even when Microsoft was being open about Azure having difficulty getting Intel chips, AWS, GCP etc. were in the same position and just not really talking about it. From my time in AWS there were some other times when some services with specialised hardware came really, really close to running out of capacity and had to scramble around with major internal "fire drills" against services to recoup capacity.

Most people won't run in to these issues, the clouds all tend to be good at it, but they still happen.

There are also advantages of the economy of scale and brand recognition. The more customers you have the more the capacity trends smooth out, the easier it is to predict need, even if you're still stuck with uncertainty on the ordering side.

It’s certainly true I run into these things with AWS as well, but it’s generally limited to a specific instance type/availability zone combination. I’ve never had all instance types unavailable.

If anything, I’m surprised we can just spin up a few hundred instances out of nowhere and not run into capacity issues.

AWS has capacity issues you can generally mitigate. Azure however will just lock you out of a solution completely and tell you to switch regions as if that was some trivial thing.
They have a lot of technical debt. They have like 6 different clouds (at least 4 gov clouds alone) and SLA commitments to things like O365 that silo their infrastructure.

MS also makes all sorts of crazy deals and commitments, and I wouldn’t be surprised if being collocated with a strategic customer may lead to local shortages of resources.

AWS has at least 3 publicly-discussed 'clouds' (or partitions, as they're called at AWS). There may or may not be other partitions that cannot be discussed publicly.
There’s a pretty clean demarc between the AWS clouds. With Microsoft because they have O365 and Azure AD dependencies sprinkled everywhere with varying features it’s a real mess. So you can do government contract with with device managed by Windows Autopilot & Intune in a commercial cloud, have email in a Gov Community Cloud, and deliver apps in a US Gov cloud, all with different identities etc.
> As far as chip shortages, it probably helps that Amazon makes its own chips.

IDK what chips you are talking about, all x86 (which I assume is most of their compute) is Intel or AMD. If they make their own they are only making the ARM instances.

AWS has three processors: Graviton, Inferentia, and Trainium. They're made in-house.

https://aws.amazon.com/silicon-innovation/

And none of the above are x86. Even if they're making their own silicon, it is for specialized use (ML) and not general server provisioning.
Amazon's own chips are ARM. ARM requires somewhat specialized builds of software that are likely different than development instances, CI/CD, and/or local dev machines. It's not insurmountable but does certainly complicate usage.
Your local dev machines might be Macs though, in which case it might be easier for you to go with ARM servers than x86.
They might be. My local dev machine is a Mac. I've found Intel or Intel+ARM container images; never an ARM only. Again, not insurmountable but certainly more resistance than the straight intel route.
> This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.

All cloud providers are NOT equal here. Amazon over-provisions and sells the excess capacity as spot instances.
So does google, so does azure etc. etc. https://cloud.google.com/spot-vms, https://azure.microsoft.com/en-us/products/virtual-machines/...

Spot instances exist just to try to turn over-provisions in to not a complete loss. You're at least making some money from your mistake.

edit: You should consider "spot instances" in general to be a failure as far as a cloud provider is concerned. It means you've got your guesses wrong. You always want a buffer zone, but not that much of a buffer zone. The biggest single cost for cloud providers is the per-rack OpEx, the cost of powering, cooling etc.

Cloud providers aren't guessing at demand to plan capacity, they're literally building new data centers and then wheeling new racks into them as fast as they physically can (short-term decisions are more likely made at the other end, e.g. when to retire old systems, not add new ones). AWS was born out of the fact that Amazon's own compute needs are inherently variable so to meet peak demand they had to "over-provision" compared to average demand--this in turn meant they had a lot of excess compute power most of the time. At the point when Amazon still was a dominant consumer of AWS, spot instances were actually a deliberate convenience to Amazon, since it meant AWS could monetize resources while still ensuring Amazon could claim them instantly when needed (later they added a two minute warning, but early on they could literally disappear at any moment, and regularly did).
You're talking to someone who has spent the last decade working for major cloud providers, including AWS, on infrastructure and services sides of things, including work around data feeds for the capacity management teams. I have more than a passing familiarity with the way things actually work at a cloud.

They are constantly guessing at cloud capacity. Short, medium, and long term models with forecasting galore, all under constant recalculation based on customer actions (they literally take live feeds of creation/termination actions), and yes they also take in to account hardware failure and repair rates. Consolidating racks of equipment is a pain in the neck and tends to be avoided, unless you can safely live migrate away all instances.

They all build up various models, using all sorts of forecasting techniques. The longer range forecasts are involved in data center provisioning, along with other business analysis, market research, legal analysis etc. that helps define where future regions should be.

It's still a guess. They can't tell what the actual demand will be, and they can't tell what is going to happen with the supply chain (supply chain issues are the biggest nightmare for capacity planning teams). Sometimes they get it wrong.

The capacity management teams spend a lot of time and expertise to keep the company just sufficiently ahead of demand. It's a crucial part of keeping costs under control.

It's logistics no more and no less. Logistics has been a thing for ever (satisfy a resource requirement). My old man (is not a dustman) but he was Commander Supply for quite a lot of people. At one job, he and his staff would worry about things like Austrian plain chocolate covered mint centred frogs (I'm not joking) to Gurkha rice and not much else (some very concentrated protein etc) water-proofed combat rations. This was in Cyprus in the '80s. Logistics on the green line in Cyprus is probably still as mad now due to the number of countries in the UN.

Anyway, capacity planning is very well understood in general but of course the devil is in the details.

At the moment the IT supply chain is pretty spotty and that affects my little IT firm up to the big boys.

When you buy Cisco + HPE + Dell or whatevs, you go to your reseller (me). I go to my distributor and they suck hardware out of Dell etc and take their cut and I install the gear and take my cut. Sometimes a disty thinks they can do reseller too. The thinking is that they can roll up two lots of margin and shave a bit. That's fine if you can actually do logistics and the "teeth arm" job too.

Clouds think they can go even further and sometimes they can and sometimes not. Now we have a sodding complicated resource on offer with a supply chain that is a bit random.

The whole hyperscale cloud premise is based on infinite availability of raw resources and that is complete bollocks. You can't hyperscale if you can't source stuff indefinitely.

Those Austrian mint filled choccy frogs became a thing for a while. I gave no idea of the exact numbers but presumably Austria supplied quite a lot of them for the UN forces and families in Cyprus in the '80s - they became a bargaining chip for a while. They came in a cardboard package with a lid coloured light blue with outlines of frogs and I think the main box was dark brown or black.

So does Azure.
Never happened to me in AWS.

Wasn't the whole point of "the cloud" that these things shouldn't happen?

Azure has some of the biggest outages like when they went down on Feb29th for the whole day.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

It seems like in nearly 3 out of every 4 years the whole internet is unusable on February 29... why pick on microsoft?
10 years ago, has there been something similar recently?
The last one I remember is this one from August this year: https://redmondmag.com/articles/2022/08/30/microsoft-blames-... It was not a complete outage but these DNS issues caused a lot of pain.
Having worked for a company that's a very large customer of AWS's, it's not much better.

I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.

We actually use Azure for ~2 years now. It worked the most time reasonably well, even though we had also a few issues. But our current issue + ready your and other comments will probably result in looking for a new home.
> One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

I don't believe that is even remotely correct.

It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.

I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.

It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).

Depending on your scale, things are really not that complicated. If you can run your company from a single machine, having two for redundancy, and two internet links for redundancy, will likely go a loooooooooong way until something bad happens...
Out of curiosity (from someone inexperienced with Azure), is it a skill/ability chasm between MS engineering and outsourced support?

TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).

Microsoft support is really awful. Basically, if you need it regularly, you just pay for resident engineers who can bypass the wall between the product groups and you. I’ve had nothing but great experiences with those guys.

Otherwise, especially if there’s a broader problem, they play lots of games with SLAs, etc.

YES! We tried a big project in the cloud (many many many high end VMs), and Azure was SO unreliable. From BGP configs fuck ups to obscure bugs in their stack.

Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.

Cloud isn't that magical unicorn!

Yes, and what is your contingency plan for said fiber going dark?
I have DB connection issues at least a few times a week. Annoying.
New Microsoft customer at all.
The common argument of "our own hardware would be more profitable in X years" is typically countered with "but you need to pay engineers to maintain it, which adds to the cost".

Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).

I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".

Depends on how stable your needs are, but sometimes its cheaper even when you considerer total cost and not just for big deployments.

In the past 2 or three years, we probably moved more services off the cloud than other way. That said one reason for that is that most new services are build in the cloud, so there are less services off the cloud than on it.

Cloud is best, when you are starting out, when you don't know what you need, need high velocity of adding new stuff, of have very burst like demand for either traffic or cpu etc. Or if you are just small developer only team.

But if you have applications that are relatively stable, are mostly feature complete and you don't expect much sudden growth etc, it's useful to run the numbers if cloud is still something you want/need.