Hacker News new | ask | show | jobs
by andy4blaze 2131 days ago
Andy at Backblaze here. To put a pin in it, the three main factors for which drives we use are cost, availability and reliability. We have control over reliability as our systems are designed to deal with drive failure. That leaves the market to decide on cost and availability. Assuming a competitive market we can buy the drives that optimize those factors.
5 comments

Hey Andy, I've got a question: I went to analyze your raw data dumps and saw a couple of things that I wanted to understand:

* A hard drive may go missing in the data from one day to the next without appearing as "failed" -- does this mean you've taken these hard drives offline for some reason?

* A hard drive may appear as failed one day, and then appear the next day with a failed=0 flag. Does this mean that these drives have been serviced and returned to the front lines?

Hey Andy, many thanks to Backblaze for putting this together. Your stats are the first thing I looked at when looking for new NAS drives.

Since you mentioned cost, do you have data on average price per GB, both acquisition and electricity cost / year? With how many drives you guys bought there should be enough datapoints for a pretty neat graph.

The average cost per GB is about $0.02/GB or $20/TB for the drive. Electricity varies based on the data center and the negotiated or local rates, so harder to calc. I like the idea of the chart, but may be tough to get the right data.
My main concern with third party backups is the privacy aspect. I never want to worry that some silent TOS change means that ad companies are scanning all my documents. Does backboaze have any options for e2e encryption or some sort of iron clad privacy policy without a “we can change it at any time” clause?
If your 3rd party storage provider plays any role in your encryption strategy then I would suggest that you’re probably doing it wrong. Data should be encrypted before it leaves your world and not decrypted until it comes home. That way, I don’t care a whit what they do to my data so long as it’s there when I ask for it back.
Several backup tools have encryption in place so that your data can be encrypted before it leaves your device. Rclone for example has encryption and Backblaze B2 capability.
I looked at rclone once and didn’t find any information about encryption. Thanks for mentioning it
Crypt will encrypt all files before they leave your device, yes.
To the non-rclone users, this refers to one of the targets of rclone; the crypt backend wraps the real backend by chaining them together, e.g. localdata <=> crypt <=> providerstorage (basically like a bi-directional filter). https://rclone.org/crypt/

Edit: I use rclone as the backend for duplicity, so you can also chain it through another tool with different encryption and use rclone as just the transfer engine, getting all the benefits of rclone's providers with the benefits of duplicity's backup strategies.

I use HashBackup[0] (I think the author is on HN) and it has a B2 option. It encrypts your data by default, and you can set up an intermediate backup (like an external HDD) to sit between your live system and B2 so you have multiple layers of backups.

If you're comfortable setting up a cron job, it's a great fit. I use it to back up a 1.5TB Samba directory and wind up paying about $5/m for B2.

[0]: http://www.hashbackup.com/

> My main concern with third party backups is the privacy aspect.

You want tarsnap.[1]

Edit to add: Colin Percival is the author of scrypt[2] and has worked extensively with FreeBSD's portsnap, so he knows what he's doing.

[1] http://www.tarsnap.com/

[2] https://en.wikipedia.org/wiki/Scrypt

has worked extensively with FreeBSD's portsnap

To be more precise, I wrote FreeBSD's portsnap. (Also, freebsd-update.)

They have an option in software to enable encryption. You provide a key and supposedly the encryption happens at the client in your end. Obviously you have to trust their software and terms like you mentioned. Though backblaze has a great track record and is very open. If you don't want to trust them there are other softwares you could use to encrypt your data before giving it to backblaze.

But one of the benefits of backblaze is the simplicity. Simplicity of setup, backups, and restores. If you muddle with that by encrypting before giving to backblaze you lose out on part of the value.

It also usually seems when people roll their own it opens up risk of forgetting something. BB is easy, set up their encryption and you should be fine.

Just some thoughts.

> They have an option in software to enable encryption. You provide a key and supposedly the encryption happens at the client in your end.

For restore, though, decryption happens at the server end. You have to supply your key to their server, which decrypts the data at their end, then sends you the subset you are interested in restoring.

See [1].

[1] https://www.backblaze.com/backup-encryption.html

Is here a reason they don't just supply you with 5he encrypted data and give access to a tool to decrypt it?

This is the only thing putting me off backblaze

That is strange. The encryption, in that case, only really offers protection against data breaches
I've been using Backblaze B2 + restic for automated backups of my working documents and photos. restic backups are encrypted by default and Backblaze B2 is one of the supported backends.
> We have control over reliability as our systems are designed to deal with drive failure.

That surely assumes an upper limit of the likelihood of a drive failure. There was a perception that the quality of 3.5" floppy disks declined drastically in the early 21st century. Must we not fear something similar for spinning-rust hard drives once most everyone uses SSDs?

Briefly, any drive (floppy disk or tape drive) has some likelihood of failure. You can minimize loss of data (the reliability being discussed) by replicating data in more than one storage item. It just becomes a matter of how many you buy (and how good you are at keeping them all properly organized).
Today reliability is sufficient that one can meet a given data availability goal by replicating the data 2, 3, 4, 5, <whatever> times as there is only once in a blue moon a bad batch of drives when they tried out a new bearing lubricant or so. But what if the economic incentives decline, the marked breaks apart (as it arguably does), much like it happened for floppy disks once they were (perceived as) obsolete and used only in fringe application (HP logic analyzers come to mind, but also Boing airplanes). Is there not the danger that the quality drops drastically to the point that one would need an unreasonable number of copies?
> Today reliability is sufficient that one can meet a given data availability goal by replicating the data 2, 3, 4, 5, <whatever> times as there is only once in a blue moon a bad batch of drives when they tried out a new bearing lubricant or so.

Unless you have an uptime bug in the firmware where all your drives die at once:

* https://www.zdnet.com/article/hpe-tells-users-to-patch-ssds-...

It's a factor of how quickly they can replace drives and how well redundant data is spread between disparate systems. IIRC, they make sure data is dispersed not only at the chunk and drive level, but the system and rack level (and maybe datacenter level? not sure).

At that point, if there's not contingency redundancy built in (See below), it's really a matter of how long it takes to replace a drive (in both identifying the problem, physically replacing the hardware, and replicating data to it). There's a lot of (fairly simple) math involved in running down those numbers, but based on the percentage of drives that fail in a quarter, I think it would take both a spectacular run of bad luck combined with negligence on their part in making sure redundancy levels are kept over a longer period to actually have problems.

> Is there not the danger that the quality drops drastically to the point that one would need an unreasonable number of copies?

I think the very simple way to look at this is that space capacity and automatic redundancy checking can account for a lot of bad drives. E.g. if a drive has 100 chunks of data all copied to 100-200 other drives and systems (such that there are three copies of any chunk), that the data exists three places, and if that drive dies and the system detects those 100 chunks are now only exist in two places, it can immediately locate 100 locations that have capacity to receive a chunk and start replicating data to keep the level of redundancy they need. Even if there was a very large set of bad drives, they would have to all go bad in a very short time frame, short enough that the couldn't be physically swapped out and data couldn't be copied across the network, for it to cause a problem.

At least that's how a system like this could be developed, and my understanding is that Backblaze's system works like this to some degree.

> (HP logic analyzers come to mind, but also Boing airplanes)

Typo of the year, "Boing 737MAX" sounds more like a basketball than something I'd want to fly in.

I would like to thank you for putting numbers here too!