Hacker News new | ask | show | jobs
by smerritt 5293 days ago
We had something similar but smaller (~8 TB) at a place I worked, and it was a nightmare. Migrating from that to S3 was one of the best things to happen to that project.

Being a single big box, it had a bunch of single points of failure, and boy did they fail; we probably had 5-10 hours per month of downtime due to the photo server falling over (flaky RAID controller firmware, mostly).

Also, since the big box was expensive, we only had one in production. There was code for taking a newly-uploaded photo and copying it over to the photo server that only executed in production, which meant the only way to functionally test it was to ship it and hope.

We switched to S3 about a year ago with different buckets for prod, staging, and dev; the production-only code paths went away, and there hasn't been any photo-related downtime since. Definitely worth it.

2 comments

This sounds like you had a really bad implementation. Proper file server of this small size would not fail for several hours per month.
Absolutely true. The RAID controller would randomly lose drives and the driver for it would randomly cause kernel panics. We tried different firmwares and different kernels and made some progress, but never really got it stable under load.

However, that's the risk you run with single points of failure. Put all your data on one big box, and any failure in your RAID hardware, RAID firmware, RAID drivers, network drivers, kernel, RAM, OS, et cetera will take down the big box and thus take down anything relying on it.

The lesson I learned wasn't to make a super-robust single system, it was to have enough redundancy to stay up when something inevitably fails.

I agree, that just should not happen unless the server was a complete lemon or badly assembled.
flaky RAID controller firmware

Fortunately, this can usually be rectified with a simple application of money. Good hardware is its own reward.

However, it sounds like the problem that S3 cured was caused by a bad architecture.