Hacker News new | ask | show | jobs
by tferris 4864 days ago
Saturday, I thought: "This guy (PG) is really brave, everybody is nowadays using virtualized server environments on mulit-core machines and PG just gets some bare metal and puts his app on one core and one thread. WTF, this guy is a genius, single-threaded apps on bare metal FTW, this is so great and I am doing this too."

After seeing this thing down for several hours -- and Google hate downtimes, they punish you instantly -- I think: "Maybe not."

However, happy to hear what went wrong and why we still should go for the bare metal thing.

6 comments

Virtualized environments vs. bare metal has never really been about performance. It's about whether you want to spend the time required to tend your servers, or whether you want to treat them as disposable.
What would the benefits of using a virtualized server for a relatively small (in terms of server needs) site like HN be? The way I see it so far, is that virtualization is mostly beneficial for hosting companies that can provide cheaper hosting and have better isolation and easier resource allocation. But I can't think of any big benefits for the site owner, assuming they can afford a dedicated server, and I can see downsides of virtualization, so I am genuinely curious if I miss something.
One thing is abstracted failure-proof(ish) disks. I'm considering moving from Linode to Hetzner for the enormous amount more RAM I can afford on a bare metal box there, but the one thing that gives me pause is having only RAID1 redundancy, and having to manage it myself.
(Live) Migration is one nice feature of virtualization.
the site actually runs very quickly, so can't argue it needs more cores / threads.

seems the issue was more with the migration.

so i reckon you're good to get back to removing all the threads from your apps :)

I totally agree, this new machine is so fricking responsive that I still think bare metal rules but I am just so afraid of the system operations of such a machine. But maybe Paul can give us some hints and why it's still worth to go the bare metal route.

EDIT: since Heroku fooled Rapgenius and us all it would be one more reason to get into system operations again and host on bare metal.

Many people use baremetal for their base-load and virtual/cloud instances for their fail-over and elastic loads. Its not all or nothing.
I wanted to reply that it's actually relatively fast, not very fast. Much faster than before, but there are many sites which are much quicker.

Timing some pages, the response is around 350ms. That means 95% of the loading time is networking, not generation time. You're right, the server is really fast given the load HN brings down on one server!

Virtual servers vs dedicated servers has not much to do with reliability. Both go down. What you want to minimize the risk of downtime is redundancy, e.g. hot fail-over.
what's the best way to have hot fail over?
That very much depends on the scale you're at and the database that you use. If your scale is such that you could comfortably run on 1 machine (as it is for 99% of sites), it's probably sufficient to rely on your database to handle the replication. For example for postgres: http://www.postgresql.org/docs/devel/static/high-availabilit...

Then to utilize this you want a load balancer in front that's unlikely to fail and if it fails can be restarted rapidly. This load balancer should send requests to the other server as soon as one has failed. The other option is DNS failover, which can work but has different trade-offs.

One thing that this does not protect you against is repeatable failure due to a software bug. If a request comes in that happens to crash one of your servers, load will be quickly switched to the other. But the user that cause the original server to crash might refresh his page because he didn't get any response back, which also sends the problematic request to the other server, and might crash that one too. These kind of problems are very hard to deal with (if the requests don't come from humans but from programs that automatically retry their request to other servers when they don't get a response it's even worse: a single bad request can bring an entire cluster down in seconds).

Paradoxically sometimes measures to improve availability can cause availability to go down. If you make your architecture more complicated you introduce more opportunities for failure, especially due to bugs. If you have multiple servers in a pipeline, for example a load balancer and then the real servers which in turn talk to the database servers, that can also cause your likelihood to fail due to hardware to increase. If your pipeline depth is three, then the chance that you have a hardware failure is about three times as big compared to when you had a pipeline depth of one. You want to minimize the depth and maximize the width of the pipeline.

So you should ask yourself whether it's worth it, especially when you're small. Many of the successful sites had or still have a lot of downtime. Maybe it's better to minimize the duration of downtime with fast restarts rather than trying to lower the probability of downtime.

Oh, that explains why this site is really slow occasionally.
how did you know everyone uses virtualized server environments on mulit-core machines..