| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by l-p 1342 days ago

> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You're new to Azure I guess.

I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.

One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

11 comments

Twirrim 1342 days ago

It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.

marcinzm 1342 days ago

In my experience there are differences between clouds so while all have the same basic problem in practice some may be better than others. I've never had issues getting GPUs on AWS but GCP constantly has issues with GPU/TPU capacity.

indoorskier 1342 days ago

Is this region dependent? In us-east I can’t get them to approve a quota for GPU instance families (G,P) for anything more than 4 CPUs. At one point they rejected my request citing “unprecedented demand”. Of course this is small time, just my personal account.

It is true I can get an instance most of the time, but not if I need >16GiB GPU memory.

bane 1342 days ago

We've been having the same problem getting GPU instances on us-east. Multiple week-long delays to escalate and talk to yet the next person up who can make a decision. It's a mess.

ajmurmann 1342 days ago

There probably are difference occurrence rates. We had to modify how our test suite provisions instances, since we used to regularly run into instance availability constraints on EC2 during the holidays.

AmericanChopper 1342 days ago

I’ve occasionally seen some of the internal AWS capacity management dashboards, and they can frequently be operating very close to 100% on some resource types.

jerpint 1341 days ago

I worked on a project about a year ago where we would have a colleague in a different time zone start instances with 4 gpus because it would almost always be unavailable during regular work hours for us-east

whoknew1122 1342 days ago

It may be a risk borne by every cloud provider, but why does this only really happen to Microsoft among large providers?

As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.

Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.

Twirrim 1342 days ago

There's all sorts of examples of AWS failing to be able to provide capacity too. Just do a search for "aws InsufficientInstanceCapacity" or similar. I remember Fortnite talking about capacity limits in relation to an incident, but I'm struggling to find the post-mortem I saw it in.

Even when Microsoft was being open about Azure having difficulty getting Intel chips, AWS, GCP etc. were in the same position and just not really talking about it. From my time in AWS there were some other times when some services with specialised hardware came really, really close to running out of capacity and had to scramble around with major internal "fire drills" against services to recoup capacity.

Most people won't run in to these issues, the clouds all tend to be good at it, but they still happen.

There are also advantages of the economy of scale and brand recognition. The more customers you have the more the capacity trends smooth out, the easier it is to predict need, even if you're still stuck with uncertainty on the ordering side.

Aeolun 1342 days ago

It’s certainly true I run into these things with AWS as well, but it’s generally limited to a specific instance type/availability zone combination. I’ve never had all instance types unavailable.

If anything, I’m surprised we can just spin up a few hundred instances out of nowhere and not run into capacity issues.

llama052 1339 days ago

AWS has capacity issues you can generally mitigate. Azure however will just lock you out of a solution completely and tell you to switch regions as if that was some trivial thing.

Spooky23 1342 days ago

They have a lot of technical debt. They have like 6 different clouds (at least 4 gov clouds alone) and SLA commitments to things like O365 that silo their infrastructure.

MS also makes all sorts of crazy deals and commitments, and I wouldn’t be surprised if being collocated with a strategic customer may lead to local shortages of resources.

whoknew1122 1342 days ago

AWS has at least 3 publicly-discussed 'clouds' (or partitions, as they're called at AWS). There may or may not be other partitions that cannot be discussed publicly.

Spooky23 1341 days ago

There’s a pretty clean demarc between the AWS clouds. With Microsoft because they have O365 and Azure AD dependencies sprinkled everywhere with varying features it’s a real mess. So you can do government contract with with device managed by Windows Autopilot & Intune in a commercial cloud, have email in a Gov Community Cloud, and deliver apps in a US Gov cloud, all with different identities etc.

hitpointdrew 1342 days ago

> As far as chip shortages, it probably helps that Amazon makes its own chips.

IDK what chips you are talking about, all x86 (which I assume is most of their compute) is Intel or AMD. If they make their own they are only making the ARM instances.

whoknew1122 1342 days ago

AWS has three processors: Graviton, Inferentia, and Trainium. They're made in-house.

https://aws.amazon.com/silicon-innovation/

boarush 1340 days ago

And none of the above are x86. Even if they're making their own silicon, it is for specialized use (ML) and not general server provisioning.

cosmotic 1342 days ago

Amazon's own chips are ARM. ARM requires somewhat specialized builds of software that are likely different than development instances, CI/CD, and/or local dev machines. It's not insurmountable but does certainly complicate usage.

philwelch 1342 days ago

Your local dev machines might be Macs though, in which case it might be easier for you to go with ARM servers than x86.

cosmotic 1341 days ago

They might be. My local dev machine is a Mac. I've found Intel or Intel+ARM container images; never an ARM only. Again, not insurmountable but certainly more resistance than the straight intel route.

Spooky23 1342 days ago

> This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.

more_corn 1342 days ago

All cloud providers are NOT equal here. Amazon over-provisions and sells the excess capacity as spot instances.

Twirrim 1342 days ago

So does google, so does azure etc. etc. https://cloud.google.com/spot-vms, https://azure.microsoft.com/en-us/products/virtual-machines/...

Spot instances exist just to try to turn over-provisions in to not a complete loss. You're at least making some money from your mistake.

edit: You should consider "spot instances" in general to be a failure as far as a cloud provider is concerned. It means you've got your guesses wrong. You always want a buffer zone, but not that much of a buffer zone. The biggest single cost for cloud providers is the per-rack OpEx, the cost of powering, cooling etc.

femto113 1342 days ago

Cloud providers aren't guessing at demand to plan capacity, they're literally building new data centers and then wheeling new racks into them as fast as they physically can (short-term decisions are more likely made at the other end, e.g. when to retire old systems, not add new ones). AWS was born out of the fact that Amazon's own compute needs are inherently variable so to meet peak demand they had to "over-provision" compared to average demand--this in turn meant they had a lot of excess compute power most of the time. At the point when Amazon still was a dominant consumer of AWS, spot instances were actually a deliberate convenience to Amazon, since it meant AWS could monetize resources while still ensuring Amazon could claim them instantly when needed (later they added a two minute warning, but early on they could literally disappear at any moment, and regularly did).

Twirrim 1342 days ago

You're talking to someone who has spent the last decade working for major cloud providers, including AWS, on infrastructure and services sides of things, including work around data feeds for the capacity management teams. I have more than a passing familiarity with the way things actually work at a cloud.

They are constantly guessing at cloud capacity. Short, medium, and long term models with forecasting galore, all under constant recalculation based on customer actions (they literally take live feeds of creation/termination actions), and yes they also take in to account hardware failure and repair rates. Consolidating racks of equipment is a pain in the neck and tends to be avoided, unless you can safely live migrate away all instances.

They all build up various models, using all sorts of forecasting techniques. The longer range forecasts are involved in data center provisioning, along with other business analysis, market research, legal analysis etc. that helps define where future regions should be.

It's still a guess. They can't tell what the actual demand will be, and they can't tell what is going to happen with the supply chain (supply chain issues are the biggest nightmare for capacity planning teams). Sometimes they get it wrong.

The capacity management teams spend a lot of time and expertise to keep the company just sufficiently ahead of demand. It's a crucial part of keeping costs under control.

gerdesj 1342 days ago

It's logistics no more and no less. Logistics has been a thing for ever (satisfy a resource requirement). My old man (is not a dustman) but he was Commander Supply for quite a lot of people. At one job, he and his staff would worry about things like Austrian plain chocolate covered mint centred frogs (I'm not joking) to Gurkha rice and not much else (some very concentrated protein etc) water-proofed combat rations. This was in Cyprus in the '80s. Logistics on the green line in Cyprus is probably still as mad now due to the number of countries in the UN.

Anyway, capacity planning is very well understood in general but of course the devil is in the details.

At the moment the IT supply chain is pretty spotty and that affects my little IT firm up to the big boys.

When you buy Cisco + HPE + Dell or whatevs, you go to your reseller (me). I go to my distributor and they suck hardware out of Dell etc and take their cut and I install the gear and take my cut. Sometimes a disty thinks they can do reseller too. The thinking is that they can roll up two lots of margin and shave a bit. That's fine if you can actually do logistics and the "teeth arm" job too.

Clouds think they can go even further and sometimes they can and sometimes not. Now we have a sodding complicated resource on offer with a supply chain that is a bit random.

The whole hyperscale cloud premise is based on infinite availability of raw resources and that is complete bollocks. You can't hyperscale if you can't source stuff indefinitely.

Those Austrian mint filled choccy frogs became a thing for a while. I gave no idea of the exact numbers but presumably Austria supplied quite a lot of them for the UN forces and families in Cyprus in the '80s - they became a bargaining chip for a while. They came in a cardboard package with a lid coloured light blue with outlines of frogs and I think the main box was dark brown or black.

jiggawatts 1342 days ago

So does Azure.

moralestapia 1341 days ago

Never happened to me in AWS.

Wasn't the whole point of "the cloud" that these things shouldn't happen?

adrr 1342 days ago

Azure has some of the biggest outages like when they went down on Feb29th for the whole day.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

jepler 1342 days ago

It seems like in nearly 3 out of every 4 years the whole internet is unusable on February 29... why pick on microsoft?

Godel_unicode 1342 days ago

10 years ago, has there been something similar recently?

flippingbits 1342 days ago

The last one I remember is this one from August this year: https://redmondmag.com/articles/2022/08/30/microsoft-blames-... It was not a complete outage but these DNS issues caused a lot of pain.

rufius 1342 days ago

Having worked for a company that's a very large customer of AWS's, it's not much better.

I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.

janober 1342 days ago

We actually use Azure for ~2 years now. It worked the most time reasonably well, even though we had also a few issues. But our current issue + ready your and other comments will probably result in looking for a new home.

ckdarby 1342 days ago

> One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

I don't believe that is even remotely correct.

It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.

I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.

It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).

aprdm 1341 days ago

Depending on your scale, things are really not that complicated. If you can run your company from a single machine, having two for redundancy, and two internet links for redundancy, will likely go a loooooooooong way until something bad happens...

ethbr0 1342 days ago

Out of curiosity (from someone inexperienced with Azure), is it a skill/ability chasm between MS engineering and outsourced support?

TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).

Spooky23 1342 days ago

Microsoft support is really awful. Basically, if you need it regularly, you just pay for resident engineers who can bypass the wall between the product groups and you. I’ve had nothing but great experiences with those guys.

Otherwise, especially if there’s a broader problem, they play lots of games with SLAs, etc.

aprdm 1341 days ago

YES! We tried a big project in the cloud (many many many high end VMs), and Azure was SO unreliable. From BGP configs fuck ups to obscure bugs in their stack.

Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.

Cloud isn't that magical unicorn!

SergeAx 1342 days ago

Yes, and what is your contingency plan for said fiber going dark?

roflyear 1342 days ago

I have DB connection issues at least a few times a week. Annoying.

marcosdumay 1342 days ago

New Microsoft customer at all.

Insanity 1342 days ago

The common argument of "our own hardware would be more profitable in X years" is typically countered with "but you need to pay engineers to maintain it, which adds to the cost".

Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).

I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".

unionpivo 1342 days ago

Depends on how stable your needs are, but sometimes its cheaper even when you considerer total cost and not just for big deployments.

In the past 2 or three years, we probably moved more services off the cloud than other way. That said one reason for that is that most new services are build in the cloud, so there are less services off the cloud than on it.

Cloud is best, when you are starting out, when you don't know what you need, need high velocity of adding new stuff, of have very burst like demand for either traffic or cpu etc. Or if you are just small developer only team.

But if you have applications that are relatively stable, are mostly feature complete and you don't expect much sudden growth etc, it's useful to run the numbers if cloud is still something you want/need.