Hacker News new | ask | show | jobs
by gnur 1683 days ago
> Less than 100% reliability is essential

This is actually a take most SRE's would / should believe. Every added 9 to the reliability increases the price exponentially. Finding the correct level of reliability is something most companies should focus more on, because sometimes a single physical machine that could go down once a year for a few hours is perfectly capable of providing all the resources a medium seized business could need. Proper backups, monitoring and recovery runbooks can even decrease the downtime of such a simple system to minutes, while easily saving you maybe thousands per month.

6 comments

I was surprised by the difficulty in getting a company to accept a target of “three nines five” (0.9995) at a time when they were growing rapidly and launching new physical and digital products on a rapid and continuous basis. I prevailed, but what I expected would be a five minute conversation took a couple 45 minute discussions (reducing the work uptime of people in those discussions to 0.9993 for the year... :) )

Slowing your young company down in order to turn 0.9995 to 0.9998 is almost always a terrible trade. Even turning 0.995 to 0.999 is hard to justify in most places. (That improvement saves about 35 hours of downtime per year.)

Is there a rigorous framework to arrive at those targets? How do you know what you built has 0.9995 uptime, and not just 0.99?
By far the easiest way is to measure it after the fact, but I know that’s not what you’re asking... :)

We did do some "analysis", meaning that we made some underlying guesses and multiplied them together, but the real value is in getting people to think that 1.000 is not the actual goal-line, then tracking and doing RCA on all the outages, bucketing them into categories so you know whether to invest more in diverse networking, software testing, HA for DB servers, failover sites, zero downtime releases, etc.

Many times, you can avoid entire massive projects (“we need to be hosted in 2 geographically diverse data centers for availability” “uh, no we don’t; we have a budget of 262 minutes of downtime per year and that project will save us less than 60 minutes per year on average, using the best case assumption that our own changes to implement it cause no downtime”)

If you're a large corporation one way to get a good idea is by having lots and lots of fire drills around various disaster scenarios and time how long actual service restoration and re-routing takes. For other companies it's just guess work.
Around 2012-2013 I was working on an online education platform. We had a whole web application that would serve video content, collect student answers and analyze in real-time(ish) the student progress in order to find out the next action for the student - e.g, if the student starts to get questions wrong that they were getting right before, we'd take it as a sign of fatigue and would recommend them to take a break. Or if the student was showing that has mastered a topic, we would jump ahead in the lesson to something else that needed more work.

So we needed a web server, a database, a queue system to run these heuristics and we needed to host/distribute ~100GB worth of content, most of it video.

We were bootstrapping, so I was trying to (1) save as much as possible on operational costs and (2) punt on all the "scaling issues" that would require more of my devops time that would be better spent developing and adding more features. I deployed the whole system on a single server from Hetzner: Django app, Postgresql, Redis for caching and session management, RabbitMQ for celery. All in one machine with 32GB of RAM and a RAID system with enough capacity to hold the data. I think it was costing us less than 50€/month. That is all we needed to (easily) serve ~800 students and the staff who would author new content.

In the end we delivered everything we promised to our first customer, but we were not able to grow our revenue as much as we expected, so by end of 2013 we just put the whole company on the backburner, got a small maintenance contract with the main customer and went on to find another jobs.

From end-2013 until 2018, I needed only to make sure that our domains and SSL certificates were up-to-date every six months, upgrade django packages in case of security issues and deal with ONE incident (in 2016 IIRC) where a disk failure put the array in degraded mode, which I solved by getting a new server at Hetzner (better specs and cheaper, after all those years), warning the customer that the service would be taken offline for a couple of hours later in the day, rsyncing the content, restoring the database and redeploying the application with the fabric script.

This is one the projects that I am most proud of what was accomplished given all the constraints and made me realize the difference between a Software Developer and an Engineer. Yet, it translates to a very poor entry on an CV. We are too used to ask on interviews what people have done and what technologies they have used, but we rarely ask about the moments where it was best to avoid doing something.

Esp if you consider bare metal servers. I'm currently paying 45€ for a Ryzen server with 64gb ECC ram and 1tb nvme storage (raid1).

The speed is incredible if compared to ec2 or root server performance from other vendors. Even if they've dedicated resources.

The cache misses alone mean the cloud should be cheaper than bare metal. In general you can buy outright any cloud service for about 3 months of the price of the cloud.

Why anyone would run their pointer chasing code in a heavy cache eviction environment is beyond me. The code is slow to start with, and then you make sure that none of your data is in the cache. Why you'd pay 10x for slower hardware makes no sense.

What people should be doing is running on bare metal and turning off all the garbage meltdown protections that kill performance. If you're not a cloud provider and you're allowing people to execute arbitrary code on your hardware, you've got much bigger problems than meltdown.

> In general you can buy outright any cloud service for about 3 months of the price of the cloud.

If you compare on demand lrice for cloud, sure. Reserved and spot instances change the balance significantly. If you're running a handful of servers, sure it's a no brainer. But when you start dealing with any sort of human cost (operations, it) the savings you get are dwarfed by the human costs because that's what you're paying for with aws and azure. And, when you're at mega scale you're negotiating separate deals anyway.

That's also not considering the value of the combined offerings. On aws for example, I can spin up a kubernetes cluster with rolling updates pushed by GitHub actions in less time than it took me to write this comment, and it will be usable and modifiable by anyone who has experience with aws or k8s. the cost savings of running my own infrastructure and managing all the moving pieces is dwarfed by the fact that the service provided is widely used and well known.

> I'm currently paying 45€ for a Ryzen server with 64gb ECC ram and 1tb nvme storage (raid1).

That does sound like a really good deal!

Until now i've only been using VPSes (apart from homelab servers as CI nodes etc.) because they're cheaper for the smaller sizes, but for comparison's sake, the cheapest VPS provider's (that i know of and trust) offering with 64 GB of RAM and 640 GB of storage would cost ~260 euros a month: https://www.time4vps.com/?affid=5294

Well, i guess there's also other VPS providers out there that can nearly match the price, like Contabo, though they do have mixed reviews: https://contabo.com/en/ (personally i just found their UI to be extremely dated and there are setup fees, but otherwise they were decent), though even then they'd cost anywhere from 30 - 90 euros a month.

yeah, low resource VPSs are great value if you don't mind the performance too much.

I was using a Netcup root server with 2 dedicated cores/8gb ram before i switched to my current hetzner baremetal server. It only cost ~7€ per month, so much better value if i you don't mind that everything just takes a little longer.

i dont think i'll ever go back though. even using the shell on the baremetal server is so much more responsive vs the vps.

but for what its worth: you can get a VPS with similar resources (16 cores, 64gb ram, 2tb ssd) for 40€ with netcup.

And anything that is static and needs to be up can just be cached at the edge somewhere, which is peanuts really, and means that if your bare metal goes down, you can still keep something up
May I ask you were you rent it?
I pay about the same for a server from Hetzner - from the server auctions (https://www.hetzner.com/sb)

- AMD Ryzen 5 3600 6-Core Processor (Cores 12)

- 64GB

- 2 x 2TB HDDs

That's cheap.

We really underestimate the costs of running in the cloud.

It's mostly marketing of aws employees and professionals that built their career around aws price and complexity.

A great idea to be honest, the market willing to overpay for server will probably be able to pay more to you.

I guess a lot of the cloud costs are due to not having to really manage anything yourself - you are essentially paying for a team of people to keep the 'server' up and running and make sure that things 'Just Work' (largely, anyway)
No amount of money makes a system 100% reliable.

On small platforms we are still stuck into the 1990's approach of having one reliable system.

We need distributed[1] systems and protocols even in small applications. Easy to use and self-healing.

[1] No, I'm not talking about blockchains

My former employer use to target 99% uptime for non-essential systems. It made a ton of sense, the cost of downtime was often incredibly low, while the cost and complexity of making it 4 9s was really high.
There's a huge jump in cost and operational style, to go from two nines to three, because it means you have to have 24/7 support coverage or an on call rotation (and good alerting, or else it's for naught) for nights. Two nines just means you need someone to check their messages sometimes, during the day, on weekends. One nine, and you can forget about the weekends, too—and that's actually sorta OK for certain applications.

Three nines also means you can't afford to intentionally take a system down to work on it, or you'll burn all your "oopsie" downtime. That means a ton more work in infrastructure and deployment processes, than two nines.

If you’re not a global enterprise, you just don’t respond off hours.
The Google SRE book, which I think is a reasonable reflection of SRE culture generally, actually mentions this in the very first chapter.