Hacker News new | ask | show | jobs
by whakojacko 5446 days ago
While you are here....Why only a single boot drive? It seems like an obvious failure point, and at only $40 per drive I would think soft raid 1 would be no-brainer for reliability.

Regardless though, super impressed by the work into rolling your own hardware, hope you guys continue to do well.

2 comments

I just asked, and found out we actually HAVE had a number of boot drives fail in our fleet of 200 pods. Most decisions in the pod are around saving money, so our initial thoughts were just that no customer data is on the boot drive so it isn't all that important. But don't get me wrong, there are SO MANY GOOD opportunities to improve the pod, Backblaze just stops working on the pod when it does what is needed for us and we run off to focus on other things. Your call on the boot drive is every bit as valid as ours. :-) I'm staring at an open pod here and I see plenty of good spots to put a second boot drive, and we'll probably be going to a 2.5" form factor (laptop) boot drive sometime soon which would yield even more space.
(Just FYI, I'm sure you guys have already thought of or experimented with this...)

We've had good luck so far with using small USB flash drives for booting big file servers. We keep the drive image pretty generic and if there's a problem with one, we just replace it with a cloned USB flash drive and reboot, no problem.

It doesn't seem to hurt performance at all for these kinds of uses -- although we do set it up without swap to keep the life of the USB flash device reasonable, which might or might not work in your case.

Personally, I'd eliminate the boot drive and PXE boot.
We're considering a PXE boot solution (among other solutions) just to keep all 200 pods (and growing!) updated to the latest Debian. But we also use the boot drive for a few other things like error logging and such. But the idea of eliminating the boot drive entirely is not a bad idea, we could drop the logs in a folder at the top of the data drives. We already (selectively) mirror various excerpts of the logs to other machines in case the whole pod disappears from the network so we have some history and understanding of what was going on when it went missing.
Perhaps streaming logs would help you out? My employer dumps a preposterous amount of log data constantly via streaming log systems (we use log4j, but there are solutions for syslogs, etc.). Aside from a few early hiccups, it works pretty well.