Hacker News new | ask | show | jobs
by dmourati 4603 days ago
Ugh. Backblaze is one of those companies with an extraordinarily poor design that they flout and "open source" as if anyone would follow their lead. Take a look at the physical design of their system and combine that with the published data. Consider that to remove any harddrive from their setup requires physically removing a 4u rackmount storage pod from the rack. http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v...

Also, no hardware raid, battery, or cap.

Source: worked at Eye-Fi, built 2PB storage

3 comments

Disclaimer: I work at Backblaze but I'm on the software side, I barely ever touch the storage pods anymore.

It is not true that the pod team must remove the 4U server from the rack. It is slid out like a drawer (no tools required, takes maybe 10 seconds). The drive or motherboard is then replaced, then you slide the drawer back in. So the 4U server must slide 18 inches one way, but zero cables have to be unplugged or replugged when done. This only takes one technician and no "server lift", the drawer supports all the weight.

I'm not defending this design, just correcting a mistake. Backblaze frankly "makes do" with this design because nobody will step up and make anything that fits our needs better. The number 1 criteria is total system cost over the lifetime of the system INCLUDING all the time spent on salaries of datacenter techs dealing with the pods. "raw I/O performance" is not that important for backup, so trying to sell us an awesome EMC or NetApp that costs 10x as much and has 10x the raw I/O performance is not very compelling to us. But if you came up with a design making it faster for our datacenter technicians to replace a drive faster while not significantly increasing overall costs in another area, we SURELY would listen.

Thanks for the clarification. That the PODs were on rails was never made clear to me. Still, I count that as "physically removing a 4u rackmount storage pod." Those suckers cannot be light. 10 seconds sounds rather fast. I don't imagine you could do it that fast for any of the upper pods.

While I don't recommend them outright, we settled on 3U boxes from SuperMicro. http://www.supermicro.com/products/chassis/3u/837/sc837e26-r...

We somewhat affectionately dubbed them "mullets" as in business in the front, party in the rear.

They make 4U devices as well. Cost was about $1000. We added LSI Megaraid 9280 controllers, about another $1500 and ran min-SAS back to a controller node responsible for 4 JBODs.

It's a different trade-off. The Supermicro boxes use drive trays, so swapping a hard drive requires a datacenter tech to handle the tray mounting and unmounting. The PODs just drop drives right in. They've traded off tray mounting work for chassis sliding.
Yev at Backblaze | One of our designs was for an aluminum pod..it made it..."lighter". :)
HW Raid is a PITA for the following reasons:

1. you have to muck around with more firmware and sometimes reboot in order for changes to take effect

2. if a controller dies, you have to replace it with (almost) the exact same controller in order to read the data

3. Datacenters rarely lose power, take the HW raid money and instead put servers on true A+B power feeds.

CPUs are so fast these days that they can easily handle in software all the "stuff" that HW raid used to do.

They do a different tradeoff here. There is no need for a hardware raid if performance is not your main concern (and even if it is hardware raid is no panacea), if they save everything to disk before acknowledging it they don't need a battery and I'm not sure what you refer to as cap.

Their hardware design is specifically geared towards their use-case and I applaud them for knowing how to optimize for their use-case. I wouldn't use it for mine but only because it's not a good fit.

They can open-source the hardware because the real secret sauce is the software and the hardware open sourcing gives them a nice edge in marketing.

cap=capacitor. http://www.lsi.com/downloads/Public/MegaRAID%20SAS/MegaRAID%...

Edited to add: They've optimized for hardware purchase price and given up reliability (HW RAID, battery, cap), performance, and maintainability. The strange thing is the overall cost of the storage system is driven by power, not purchase price. Smarter RAID controllers, like I link above, let you manage power by spinning down disks as they are unused and thereby reducing your power draw. Can't do that with SW RAID that I've ever seen. Take a look at Amazon Glacier which I suspect is using this power-off strategy to drastically reduce their costs.

Their use case is mostly write-once, they fill the data and never delete. The write-rate is probably more limited by the upload speeds of its users than the disk bandwidth and the multitude of port-multipliers that they use. Recovery is anyway mostly about copying all the user data to an external HDD and ship that which is a lot less performance critical as shipping the HDD will take a lot longer than reading all the data from across their systems.

As for saving power by spinning down disks, it is likely to be useful to them and is completely possible even in SW RAID though it requires some managing to perform effectively.

There isn't much that is applicable directly to most other use-cases but if your data is mostly sitting idle and you only need occasional access to it the backblaze pod is a nice design. If you care about performance and do not deploy multiple pods with redundancy between them you are not likely to be happy with the result.

> Recovery is anyway mostly about copying all the user data to an external HDD and ship that which is a lot less performance critical as shipping the HDD will take a lot longer than reading all the data from across their systems.

I've restored just a few files from Backblaze. While it's an "offline" operation where you choose the file, then get a notification when it's ready to be downloaded, it took only a handful of minutes.

It's not why I signed up with them, but it was delightful that it worked.

Actually, Glacier is probably backed mostly by tape with a disk cache. Writing then becomes cheap, as you can grab the next blank tape, but restores take a while, as they have to pull the specific tapes needed from whatever storage system they're using.
Seriously doubt Amazon is using tape.
No one (outside Amazon) truly knows what Glacier is on, likely it is a combination, and tape may play a role, that's why it's relatively inexpensive to house the data, but the costs to get it back are very high and are for "emergency, everything else has failed" situations.
Asked what IT equipment Glacier uses, Amazon told ZDNet it does not run on tape. "Essentially you can see this as a replacement for tape," a company spokesman said via email.

http://www.zdnet.com/amazon-launches-glacier-cloud-storage-h...

See also: http://en.wikipedia.org/wiki/Amazon_Glacier#Storage