| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arturhoo 3917 days ago

We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests.

As much as I admire and rely on AWS' scale to build architectures and fault tolerant applications, it can't be ignored that the marketing towards going "full cloud" doesn't take into account how hard it is to build resilient architectures in the cloud.

I see those disruptions events as stop signs: when the cloud itself fails to scale, I rethink a few decisions we all make when surfing those trends.

http://yourdatafitsinram.com/ also comes to mind.

2 comments

LoSboccacc 3917 days ago

infrastructure is hard, and exponentially hard with the number of nodes you need to scale.

that said, even with those disruptions and whatnot happening on Amazon as a warning, I am not skilled enough nor have time enough to build a non cloud resilient infrastructure.

I was looking to go with redundant vps at first, because amazon does have high cost for us, however, just learning all the things that can go wrong in the first very part, the load balancer, and all the gritty details one have to consider for just this little component to support interruption free failover, made me rethink the cost benefit of going managed.

it is true that going cloud doesn't really remove outages risks completely and it will not be as resilient as an infrastructure built with skill and love by the best out there, but how many shops can actually roll with their own solution and get an equivalent level of availability?

scaling web nodes is within my capabilities, building a ha database is already quite above my skill but I may manage, testing database failover, making sure it works, making sure that it can actually recover from one node dying and that the application stay live meanwhile? that's way above what I can reasonably do and what my company can afford to pay maintenance for.

link

chillydawg 3917 days ago

How is it any harder to do in the cloud than on a rack in a warehouse? At least you don't have to muck about with cables and phoning power companies up.

link

antirez 3917 days ago

Just an example: during the issue even people serving 10 ops/sec, but very important 10 ops/sec, were affected by a huge complessive load which was not their for most of the part. It's true that when you "go cloud" you don't have to manage your operations, but you are basically putting everything in the hands of other op people, and what happens to you is related to a more wide set of conditions.

So managing your stuff is hard, but you are in control and can do things in a way you believe is completely safe for you. Or you at least may incur in the same events sometimes, but perhaps paying a lot less for the same services. Or you can create your deployment with characteristics which are often impossible (a lot of RAM for each server is an example) to be cost effective in the cloud.

It's not stupid to use AWS services but is not stupid to manage your operations, either in your own hardware or at least using just bare metal and/or the virtual machines service certain providers give you, but still being in part accountable, responsabile, and in control, of your system software deployment and operations.

link

chillydawg 3917 days ago

Yeah, but it's a very rare in-house team that can keep stuff up better than an AWS of a gcloud. I'd only consider it if I was doing really, really REALLY specialised stuff that absolutely could not be handed off to some mega-host.

link

toomuchtodo 3917 days ago

I used to do infrastructure on physical hardware, and we'd go years without an outage sometimes (generators in the datacenter, diesel fuel contracts, redundant fiber providers using BGP). Doing it in the cloud is harder, because you're at the mercy of the provider when things go south, and you have no transparency into why it went wrong except what they're willing to publish. Why did it happen? Will it happen again?

I mean, you can argue that the cloud is better. But how often is Heroku and AWS down? About the same as physical providers (I concede S3 is pretty solid though).

link

nickpsecurity 3917 days ago

You call up IBM. You ask for a mainframe solution for two sites. You get experts to set it up for you with your application and such. You don't worry about downtime again for at least 30 years.

You call up Bull, Fujitsu, or Unisys for the same thing.

You call up HP. You ask for a NonStop solution. You get same thing for at least 20 years.

You call up VMS Software. You ask for an OpenVMS cluster. You get same thing for at least 17 years.

Well-designed OS's, software, and hardware did cloud-style stuff for a long time before cloud existed without the downtime. Cloud certainly brought price down and flexibility up. Yet, these clouds haven't matched 70-80's technology in uptime yet despite all the brains and money thrown at them. That's a fact.

So, shouldn't be used for anything mission critical where downtime costs lots of money.

link

inopinatus 3917 days ago

This is absolute cobblers. I worked in an IT team that had a pair of IBM mainframes that were fed and watered at crushing expense and for which even the tiniest software change required a colossal waterfall project.

One day, they failed. One went offline - for reasons never revealed, at least to me - and the secondary didn't come up. Radio silence, kaput. But an airline that housed mainframes in the same DCs had their booking system fail at exactly the same times (with national headlines to match).

The myth of mainframe uptime is exactly that. La-la-land for hardware & services salesmen.

link

nickpsecurity 3917 days ago

Appreciate the counterpoint. Could in fact be a myth or legend. Lots of money to justify spreading disinformation, too. Maybe an anonymous survey by a reputable organization is in order that tries to break down what issues people have and don't have along with specific metrics. Then compile that into a big picture.

Meanwhile, the companies I've worked at all had mainframes without trouble from them that people said. Problems were virtually always the app developers or the pain of doing 21st century stuff with 60's-80's architecture or legacy code.

link

AnonNo15 3917 days ago

I've seen NonStop solution failing due to completely mundane reason of insufficient disk space after a burst of transactions. One condition for those 30-year uptimes is also a 100% predictable environment.

link

nickpsecurity 3917 days ago

Never heard of that one. Funny stuff. Shows even top tier can be improved.

Edit to add: mainframes also run user and server type workloads. Some of those are predictable, some aren't. Bull's virtualize whole desktops. The mainframe as a whole, esp important services, are usually still available despite issues with those. For instance, my company splits stuff between critical on mainframe or AS/400's plus non-critical on whatever is useful ("best-of-breed" they say...). The critical stuff is either on the IBM stuff or leverages it in client-server setup. Those apps either always work or (rarely) they fail-safe in an obvious way that does no damage. Nobody I work with can remember those systems going down over 10 years they worked there. The other stuff regularly has issues across the board. The key difference is effective architecture and how it's implemented.

link

carterehsmith 3917 days ago

Caveat here. Sure, a mainframe/high end hardware & supervisor OS can run for 30 years... but the actual applications that users are facing... no, they cannot. You need to upgrade DB2 or IMS or whatever Java app you are using? There will be downtime.

link

nickpsecurity 3917 days ago

Depends on the design. You have to plan for that stuff ahead of time. I'm not going to claim that's easy. It's just very helpful and there's companies that specialize in helping with it. Most common method was decomposing the app while running it on a cluster so parts of the app or nodes can be taken down. There's strategies for mainframes, too, but my experience was clusters.

Basic strategy was putting something in front of them that can redirect to the new system upon a trigger. Let's assume its functionality + tons of data. The new system first gets the data moved to it in batch form for efficiency reasons. Once it catches up enough, it starts syncing in a more online fashion until it gets to point that it's syncing in real-time. All kinds of tests are performed on that system throughout this process. Eventually, a change-over happens that should be barely perceptible. The inability to do this is usually due to fragile architecture or tightly-coupled implementations which are unfortunately all too common in enterprises.

Note: It can also help if your app was written in something like Common LISP or Erlang that supports live updates. That with the delta approach (version A->A/B->B) equals upgrades with no downtime. ;) Combining it with clustering approach is quite powerful but clustering approach is more applicable to tools majority uses.

link

bbrazil 3917 days ago

Given they had ~300 minutes of outage in 3 years, you're looking at ~99.98% reliable in just that region. That's pretty good for a stateful serving system, and indeed you'd be pushed to do better.

link

pfortuny 3917 days ago

That is just time not traffic loss, it would be hard to get so much traffic lost by a single in-house service.

Metrics depend.

link