Hacker News new | ask | show | jobs
by momokoko 2146 days ago
You’d be shocked how rare downtime is with modern hardware. A redundant power supply and SSDs in the right RAID configuration typically will not have any issues for years until it can be replaced by a newer model. Also, hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

In the past power supplies and spinning disc hard drives would fail much more often.

It’s basically a solved problem, outside of extremely mission critical, 5 nines kind of stuff, that we all forgot because of AWS.

HN ran, and may still run, on a single bare metal server.

11 comments

> HN ran, and may still run, on a single bare metal server.

I bet HN wouldn't do a 10 hours high-risk operation for moving their servers because they can't afford an outage. (But well, running stuff on a single bare-metal server is expensive enough that even if they could, I expect they don't.)

What would that company do if a pipe broke inside the datacenter? Besides, if you never restart your servers, you are guaranteeing that the one time when the power goes off on the entire city, they won't come back online.

> I bet HN wouldn't do a 10 hours high-risk operation for moving their servers because they can't afford an outage.

HN is probably not business-critical and could probably affort a 10 hour downtime without much hassle.

The point is that they probably also wouldn't then insist on a consultant doing an unreasonable migration and threatening to not pay them if there was downtime. And they probably wouldn't call around to other consultants with the same requirements, apparently telling them that the first consultant refused to do the job.
> apparently telling them that the first consultant refused to do the job.

While I don’t think they informed them of this in good-faith, it is a nice heads-up. In this case, it meant Consultant2 consulting RefusingConsultant that probably knew the IT better.

It would be legitimately interesting if a 10 hour downtime of HN was at all correlated to an increase in github commits.

I hope there wouldn't be a correlation, but I wouldn't be all that surprised if a somewhat loose one was found.

Quality hardware has existed for years. At a ford motor plant they were doing an inventory and couldn't locate a 10 ton mainframe. It was working so well for 15 or so years the tribal knowledge of where it was physically located was lost.
Wow, that's impressive losing that big a piece of hardware.

Though it was likely easier to find than that Novell Netware server that was sealed behind some drywall, with only a stray network cable leaving any clue as to where it was.

Depends on how big the building is that houses it – manufacturing IT can deal with impressive floor spaces.

I once only half jokingly suggested finding a missing data closet in a two million square foot distribution center by pinging a known IP from three or four aggregator switches across the building and triangulating the location on a floor plan. Sadly the people crawling around the ceiling found it before I could put my idea into practice.

2Msqft is c.430m x 430m for a square floorplan. Ping resolution is 1us (microsecond). Speed of electrical signal in cooper is about 0.8c. Gives a max resolution of ~240m by my reckoning. If there are variances in the switch+network delay it seems like you're going to struggle to even say which side of the building it is.

Good job they found it!

Hah! Good math. Based on the switch placement and the building being more of a rectangle I figured "north side or south side" would be as close as I could get. And when we really dug in it was a classic last mile problem: the first several core switches were well known, we just needed to figure out where the last aggregate switch went.

Turns out a door was closed and a new one built to a hallway to another hallway and not properly labeled on the updated drawings. Had one of the boxes running a conveyor belt not have died, we'd never have looked.

This is all true, but you still can't rely on increased hardware quality if you can't afford any downtime due to moving (a one-time event) a server.

Also, that doesn't cover other problems mentioned here, like natural disasters, ISP problems, etc.

Often these kinds of SLAs are decided upon based on blame rather than what is reasonably required by the customers of that system. In this case, moving offices means the downtime is due to internal reasons. But if an ISP goes down or there is a natural disaster, then that isn't in their control.

Also cost does come in play as well. Multiple physical links in would be very expensive for what sounds like internal services. Likewise a natural disaster might cause bigger issues to the company than those internal services going down. They might still have offsite back ups (I'd hope they would!) so at least they can recover the services but the cost of having a live redundancy system off site might not justify those risk factors.

The customers requires are definitely unreasonable though. I'd hope those systems are regularly patched, in which case when is downtime for that scheduled and why is that acceptable but not when you're physically moving the server? I doesn't really make much sense; but then "not making much sense" also quite a common problem when providing IT services for others.

You are right, their SLA can be a bit different from what we're talking about here (and expect).

In general, we don't know much about this case. It's a post on Reddit, might not even be true. As is, it doesn't make much sense, but we don't know all the details, so maybe we jumped to conclusions.

> can't rely on increased hardware quality if you can't afford any downtime due to moving (a one-time event) a server.

Mainframe is not just a server. You can hot plug RAM on these things.

Still, sooner or later, the data center will be hit by a natural disaster, a DoS attack, a network problem, or the like, and you'll have to be ready to move to a different one to get your service back online. Or you'll have to reboot your server to apply a critical kernel security update, in which case you need to be ready to fail over to a hot standby. So, since relying on a single server with high-uptime hardware is penny-smart and pound-foolish, might as go with a cloud-style architecture with commodity hardware.
I use to be fascinated with datacenters and would masquerade as a customer prospect to get a tour and see all the cool gear. I was asking one engineer about what they're plan was for a tornado (this was at ThePlanet in Dallas TX way back when) and they basically scoffed at the question. A week or so later one briefly touched down about 1/4 mile from them, I wonder if they thought about me when the sirens were going off hah.
Even in modern hardware there are plenty of single points of failure.

Single server and "can't tolerate any downtime" are mutually exclusive.

AWS and older hardware is no different. Set it once and it keeps running for many years.

I've came across old AWS account (startup have been using AWS for the longest). All the network traffic or VPN goes through a single instance with 3 years of uptime.

AWS EC2 instances or their host machines can fail at any time and it’s out of your hands.
True fact! I recently had EC2 migrate my VM when the physical server it was on reached EOL. If they had fired my VM up again, I wouldn't have even noticed. They didn't. Fortunately it had an EBS volume and I was able to manually restart it without data loss.
Physical servers can fail at any time and it's out of your hands. ;)
Human error is a bigger cause of downtime than technical failure or natural disasters. And in practice, a single server like this tends to be a hand managed one-off which only exasperates the human error component.
s/exasperates/exacerbates/
It's probably a bit of both, TBH. ;)
Unfortunately complacency about how reliable modern hardware is can lead to neglecting things like off site backups. And other issues. Yeah your one big critical on premises server may be super reliable. But what happens when the building is flooded with 6 ft of water, catches on fire, is leveled in an earthquake, or anything else?

If a function is super critical to business, it also deserves to have some thought put into the blast radius of its failure.

The sort of places that would insist on rolling a live server 700 ft across a parking lot probably don't have any real disaster recovery plan.

>hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

There's SMART for disks... what else?

And multiple power supplies. I have been running a single physical server like this for ~10 years and the only downtimes were me restarting to boot a new kernel and when people at datacenter messed up BGP routing (their fault). HW is really very reliable now, especially in datacenter environment. But still not 100% of course. There is still low, but more lower than most think, probability of it failing. IC chips most likely won't break, only some capacitors degrade over time and flash memories with bios normally guarantee only 10 years. Bios upgrade (new write) would prolong that, though. I had one disk fail in RAID. Changed the drive without any downtime.
ECC for RAM is the other big one. A single-bit error will trigger warnings, so that you can replace the faulty DIMM before it progresses into uncorrectable errors.
Is there a tool that can randomly take 128mb chunks of memory out of the pool and test them around the clock?
>HN ran, and may still run, on a single bare metal server.

HN also has downtime fairly often.

Yeah that's how you end up with 3years uptime on some forgoten servers... :)
Which is why AWS instances should be no more than minions in a load balancer pool, and any permanent state on an EBS volume or a managed storage service.
What's the current advice on SSD RAIDs?