Hacker News new | ask | show | jobs
DigitalOcean lost our data and gave us $500 (dfernandez.me)
55 points by danielfernandez 4615 days ago
28 comments

So if you were backing up your data to Tarsnap, then you'd be up and running as quickly as you could launch a new instance and redownload everything. And $500 credit is enough to power a micro droplet for 100 months, or a small droplet for 50 months. DO handled this well.

http://www.tarsnap.com

EDIT: s/years/months/g. Thanks.

You mean months.
50 months == 4 years and 2 months 100 months == 8 years and 4 months
So this is a technical problem I am having right now that's preventing me from backing up a Postgres database completely (hope someone here can help).

I have a master Postgres database that is receiving a TON of transactions per second (I'm talking about thousand concurrent transactions). We tried running pg_dump on this database, but the DB is just too huge, and it took more than 4 days to completely dump out everything. Not only that but it impacted performance to the point where backing it up was just not feasible.

No problem.. just create a slave-DB and run pg_dump on that, right? We did just that, but the problem is that you can't run long running queries on a hot standby (queries that take more than a minute).

What would you do in my scenario? With the hot standby, I technically am backing up my data, but I would have 100% piece of mind if I could daily backups in case someone accidentally ran a "DROP DATABASE X", which would also delete the hot standby/slave db as well.

There's a setting in postgresql.conf that will let you up the limit on long running queries on the standby from 30 seconds to ~ unlimited.

http://www.postgresql.org/docs/9.0/static/runtime-config-wal... See max_standby_archive_delay and max_standby_streaming_delay, -1 lets them wait forever.

Alternately, you can issue pg_start_backup('label'), backup the filesystem, then issue pg_stop_backup() and keep all the WAL logs from that time. That'll get you a base backup similar to the slave.

What I'm doing is this:

I've got a primary/hot spare pair, and a tertiary db on lesser equipment that's my second copy for cases where I have one of the main machines down or I have to rebuild the secondary from the primary.

The tertiary db ships logs to s3, after gpging them. Every $timeframe, I take a base backup and throw it up as well. I keep a couple, and delete the older ones. Every few months, I test a restore on ec2. There's a balance between the WAL logs that you need to keep, the time to restore, and the frequency of base backups.

[edit - parameter names. Further edit - strategy.]

I use postgres and ran into this issue as well.

Inside postgres.conf for the slave I have the following:

# These settings are ignored on a master server.

hot_standby = on # "on" allows queries during recovery # (change requires restart) max_standby_archive_delay = 900s # max delay before canceling queries # when reading WAL from archive; # -1 allows indefinite delay max_standby_streaming_delay = 900s # max delay before canceling queries # when reading streaming WAL; # -1 allows indefinite delay #wal_receiver_status_interval = 10s # send replies at least this often # 0 disables #hot_standby_feedback = off # send info from standby to prevent # query conflicts

So I set it to 15 minutes for this specific backup server which I am okay with. I already have another server with much shorter time delays.

So you basically sacrifice speed of replication in order to ensure long running queries don't get cancelled?
It's sacrificing the expected latency of replication.

Incidentally, if you're on 9.3 and your HW can handle it, take a look at parallelizing the pg_dump. If you've got a relatively fast disk subsystem and many cores, you can get a speedup. I've found it tends to make the dumps O(biggest table) instead of O(sum of all tables).

(It's native on 9.3, I've hacked up some scripts that do it for 9.0, but they don't get a consistent snapshot, so I do it during scheduled downtime. OTOH, the dump/restore is ~6x faster OMM/OMD, so the downtime is that much shorter)

Was the database designed using transactions to achieve consistency? If so, then you can just instruct Tarsnap to back up the folder containing your database every day, and you're done.

If the DB uses transactions for consistency, you can copy it at any time without any problems.

As long as you issue the pg_start_backup/pg_stop_backup pair and keep the WAL logs. If you don't, then you've got a corrupt backup.

At least you would catch that problem in your first test restore.

... what? The point of consistency is that if the power to your server is cut, then you can reboot and pick up precisely where you left off. That means the database on disk must have consistency. Meaning you should be able to copy it at any point in time without any problems. If you can't, then that's not consistency, and if postgres really works that way, then it's failing one of the basic tenants of being a database. http://en.wikipedia.org/wiki/ACID

Any database that purports to have consistency must be able to withstand cutting the power to the server at any time. And if it can do that, then it must be true that you can copy the database folder at any time, too, without any special commands. (pg_start_backup is not issued before every power loss, so why would it need to be issued before a copy?)

On the other hand, if postgres doesn't support consistency, then that'd be a major reason not to use it.

EDIT: I'd run the server in a VM and backup VM snapshots. VMware makes this painless (and the snapshotting process is designed to have minimal impact disk I/O performance for precisely the scenario the OP described). VirtualBox probably has something similar. These replies seem crazily overcomplicated in comparison.

Unless tarsnap does something like LVM snapshotting, then it's not going to get a consistent snapshot. You can't just copy the directory of an active server.

See: http://www.postgresql.org/docs/9.0/static/continuous-archivi... section 24.3.2. Making a Base Backup.

IF you have something like LVM of ZFS doing snapshots, then you can just tar the data directory.

Re: VMs

Leaving aside the management issues of huge vm images and the less than ideal io performance, the ACID guarantees of pg rely on the underlying hardware obeying some specific restrictions, including real fsync and not lying about when things are on permanent storage. Getting the drives and raid controllers to obey that has historically been a difficult, ongoing job that has to be redone with each new generation of hardware. SSDs have been particularly interesting with that, the actual flush to disk can be quite delayed from the logical write. Some have supercaps, some don't. Those that don't are vulnerable to power losses while the data is still in the drive's ram awaiting a block erase and write. The IDE drivers used to flat out lie. Enterprise SAS drives often come with the write caching turned on (since it looks better in benchmarks) even though they're often times used behind a battery backed raid controller.

Adding a VM layer to that just to get snapshots seems overly complicated and prone to issues.

If you're taking an instantaneous snapshot of the system then yes. A standard copy/rsync/etc. isn't going to give you that. If the copy takes a long time at what point do you grab the pg_xlog directory? and are all the files there that you need/ed?
ACID doesn't apply since you can't copy a large file in an instant. The copy takes time, in which time the files on disk can change. This isn't the same thing as the server losing power.

VM snapshots, zfs snapshots, etc are the way to go.

There's something that works and there's the right way to do it. It's better to do things the right way if you want to make sure everything is in a good state when you bring it back and there aren't edge cases you missed ... What if someone forgot to use a transaction?
The WAL is used for recovering from power loss. You need both the db files and your WAL to get a backup.
Does this interfere with the replication process at all if I run pg_start and pg_stop_backup, and rsync the files to another server?
Nope. I do it all the time.

The start/stop backup has to be issued on the master. It doesn't look like the standby gets the backup label (at least on 9.0, may have changed since). So you'd have to be reading from the master's data directory.

Alternately, you could stop the secondary and pull from there. But that interrupts the replication, and then the secondary would have to catch up, which might be hard depending on your level of usage.

feel free to email or chat my un on freenode.

What filesystem are you running on? Can you snapshot it outside of the postgres environment? The database may be mid-transaction at that point, but it's still better if it does log replay at startup, than losing all the data.

Also if your filesystem snapshots can be exposed as files / block devs, you can rsync them to another host lowering the amount of transferred data (keep the previous copy so rsync will only copy the blocks that differ).

Just a thought... If your storage layer has support for taking a consistent snapshot of your file system then you might be able to use this to get a backup.

You would get a copy of your database that you would need to run log-replay recovery on but after that it should be all good.

pg_dump is a logical backup, that is, as you've seen, it queries the all of the data in your database in writes to a file in the form of queries that will re-create all of your data in a new database. This great and very flexible, but as you've seen has some limitations.

You probably want to look into physical backups, where you basically copy the actual files that postgres is actually using to store your databases on disk (although it's not quite that simple, so do some googling on it). This has the nice advantage of not requiring you to run queries against your database to back it up. It also gives you a consistent point-in-time backup of your database.

Sounds to me like you have no choice but to shard your database in order to reduce the write load on a single database.

Or upgrade your database servers' hardware (more RAM, faster CPU, faster SSD) to the point where you can dump the database.

By the way if you need help feel free to e-mail/ping me.
The abrasive headline is kind of unfortunate, as the actual moral of the story given at the end is exactly the right takeaway: Never assume your hardware is infallible, so always have backups that you know you can use when your server experiences a wildly improbable catastrophe.

Also, very impressed by Digital Ocean's response here. Given their reputation as a budget host, they really do put a lot of effort into service.

> wildly improbable catastrophe

Or an extremely probable one like a hard disk failure. They only last a few years; most data centers see an annual replacement rate in the 2-13% range. The failure rate is a known quantity, and their limited 1-3 year warranties that reflect that expectation.

There isn't a host I've used more than a few years where I haven't seen hard drives (and power supplies) fail. I don't know if my experience is typical, but hardware RAID controllers seem to go bad on me not-infrequently too, losing the whole array at once. They don't pay you when it happens, they just replace it. DO was extremely generous here.

Was going to say the same thing, Dual drive failure on a RAID5 system with five 2TB drives is 1 in 12. With 3TB drives that goes up to 1 in 7.

The underlying issue is that the uncorrectable read error rate is 1 in 10^15 bits, this is just physics (thermal noise, read/write signal loss, etc) But with 8b/10b encoding that is only 90TB worth of bits. Rebuilding a RAID group of 5 with four 2TB "good" drives (8TB of data to be read) you will see a failure in one of the other 4 drives 1 in 11.25 times. (90/8). With 3TB drives 1 in 7.25 times. Using simple mirroring you won't be able to re-silver a mirror 1 in 1:45 or slightly more than 2% of the time for 2TB drives.

Dual parity, or triple mirrors (x3) are now the minimum bars for making storage reliable.

Well it's just a bit unlucky to have both drives fail in a RAID (although not impossible).
That's way more compensation than I would have expected. AWS usually won't even notify you until after the node has gone down.

Hardware failures happen; an application needs to be tolerant of it.

And with S3 storage so cheap, they should be backing up directly to S3, across multiple regions.
This is 2013. Why are we still talking about backups as a lesson learned? Is it because startups are skimping on Sys Admins?
It's because some startups have developers that open w3schools, start typing examples, and somehow ship a quasi-working proof-of-concept that goes into production.

There's a bit of "if it ain't broke don't fix it" here, but a whole lot of "get with the program" still required.

Well as a professional Systems Administrator, it pisses me off more than it probably should. It's like you want to know why I'm worth what I'm asking because when your shit falls down and goes boom, I'll get you back up and operational in minutes or an hour.

Because it's my fucking job to help you manage your IT risks. Azure, Heroku, AWS aren't replacements for Systems Administration, they're just tools in my arsenal. I don't understand the mentality it takes to go into business (beta or not) without having SOME understanding of your risk. The fact that DO paid you a not insignificant amount due to downtime, means you're damn lucky.

Do you know of anyone who didn't get deadly serious about backups before they had a sour taste of data loss?

Me, I was just lucky my first really interesting experience was on a big UNIX(TM) Version 6 system, with a couple of user accessible DECTapes. Buying a tape was cheap enough, and the whole thing was neat ... and then I learned the -rf flags to rm. And had any critical data I lost on that DECTape.

Today I do nightly backups of my home systems to LTO-4 tapes (as well as offsite of the most critical to rsync.net a time zone away).

Yes, of course. A full fledged sysadmin is expensive, and startups will typically make several costly mistakes before going to that expense.

This is not surprising, and is not even regrettable. If the business can't support the overhead of someone who doesn't directly bring in revenue, then it can't. And if there's a large investment that makes good infrastructure engineering possible, first-time entrepreneurs might not realize that they need that function.

The key to long term success is in realizing what you will need before it's too late to get it.

It's great you had backups, but why a write-up. Is it an attempt to smear DO's otherwise good name? It's an un-managed VPS so it's your responsibility to keep backups of your box, not theirs. And hardware fails all the time, so you can expect this to happen anywhere.
> And if you just launched and have a single instance running, let your alpha users know that there will probably be some downtime.

That's true. But there's no reason for extended downtime even if that instance goes down. Make sure your whole setup is described in chef/puppet/salt/ansible/cf/whatever and even a rebuild from scratch takes only minutes then. There's really little reason to skip that these days.

DO is affordable enough that the minimum you should run are 2 droplets. Having said that, I'm actually fairly impressed with the 500 credit and now you have no excuses to run 2 vms. Consider it a lesson learnt.
Alternate title: DigitalOcean went above and beyond their SLA for us.
DigitalOcean's pricing page indicates that "All cloud hosting plans include automated backups". (https://www.digitalocean.com/pricing) From the email you received, it sounds like this is clearly not the case. I wonder what other claims DigitalOcean is making that are not true.
There is an automated backup system that you have to enable for a droplet, that creates a snapshot every few days. It's a clear part of a droplet's control panel. They began charging for it in July 2013. The price is 20% of the droplet's monthly cost. Sounds like they need to update their pricing page.

This is pertaining to a droplet feature though, and not some low-level backup system. Meaning, it's not as if they're lying about the infrastructure below what a normal customer can see. They just have an erroneous pricing page.

That's always a risk with servers. They can die and everything can do with them. But they had backups so they didn't lose everything.
It seems that is very nice from DO. I would not expect them to be responsible of data loss in case of hardware failure.
This might sound a bit glib, but raid 5 shouldn't really be used in modern storage.

If you ignore the performance issues (which can vary by device) its just not safe. Depending on the size of drive can take anywhere up 30hours+ to rebuild.

bear in mind that you tend to use disks that are all the same batch, it leaves you in the danger zone for far too long.

Your options are: somesort of clever RAID (ZFS type thing) Another type of clever RAID (Like the LSI chunk thingy in the DCS37000) RAID 10

For SSDs, where the time-to-read/write-full-capacity is typically much less than HDDs (both due to higher speed & lower capacity), it can be less of a poor decision. SSDs also have somewhat more advanced machinery for data integrity checking and slightly friendlier failure modes (e.g., the sectors "wear out" over time, but the firmware tends to warn you as that starts to happen, and you're not going to hit a sudden mechanical failure).
Was this really a dual drive failure, or was this the rather common single drive failure plus undetected errors on a backup drive, that show up when trying to rebuild?

Because that happens a lot, and it's very important to do a full read of every drive in the array at least weekly! You have two options for doing that:

If you are using linux md raid then run the "check" command, which automatically does the test using background I/O (but does still impact things). On debian, and perhaps other distros too the mdadm command will do it every month by default. Make sure to set a minimum speed or it might never finish if you have a busy system.

You can also use the built in SMART on the disk to do a long self test. This also uses background I/O and I think it has a bit less impact on existing operations. (But you have to have some idle time on the disk or it will never finish.) If you install smartmontools you can set smartd to do this test for you every week, and keep an eye on the results.

I personally do both, plus a short self test of the disk every night.

I truly believe that we did the best we could in this instance. Drive failures are always always unfortunate, even with backups, downtime exists.

That being said, we're always genuinely looking to improve, and I'd welcome your feedback on how you feel we did and how you feel we could do better. Please do reach out to me personally john@do! Thanks. :)

"we had backups".

Do you mean you had backups on digital ocean (using their backup service) or something else?

Me too. As I know the DO backup is saved on Amazon no?
I'm wondering the same thing
Good thing you had backups.

With that being said, these days it's a good idea to use a deployment tool or configuration management system like puppet/salt/ansible/chef/etc, especially in a virtualized environment. This will help with scalability as well as situations such as these.

This is the reason why I moved all data away from my server instances. My images are hosted by cloudinary(with s3 bucket backup) and my databases are Amazon RDS instances. I don't care if a server goes down, I can launch a new one in a matter of minutes (with ansible) without any data loss.
Which of those things you named is protecting you from losing your database? I paid the uber-high fees for RDS with Multi-AZ failover and... well... it failed, then didn't fail over to another AZ. The instance ended up down for hours before they recovered it. That's when I jumped ship from AWS, wrote off the reserved instance payments, moved the database to some rented servers at SoftLayer, and handle nightly off-site backups myself. Not only do I have working backups and failover, but 4-8x the capacity per dollar.
The author is sweet, his conclusion was "always backup your data" if it was me I would probably say "I'm moving away, will never trust them again on my data" ..
Is there a provider that credibly offers high availability Linux servers? Disks fail, capacitors fail, power fails, network equipment fails (a lot). I'm sure it's possible to build an ultra-reliable server that mitigates all that but I doubt it would be worth the money.
Hardware fails, there's no way around it.

At 5$/month, I think it's not too much an investment to have some basic redundancy if you care about your data. Anyways, if your data matters to you, do backups.

To whom would you migrate? It seems to me that you wouldn't get better service without a managed server.
The $500 credit from DO is quite reassuring. Usually if the HD fails and your data is lost, your out of luck. I hear the "horror" stories of some hosts reusing consumer Hard Drivers between servers so learned, Your data is your responsibility. I'm glad the OP had backups but these failures happen, thankfully DO had the business sense to compensate them.

Seems good advertising for DO, as any knowledgable system admin knows Drives fail. DO could have not done anything.

Linkbait title, they handled it exceedingly well. Onus is on you to back up your data. You did not 'lose' your data, given you had backups.
> And if you just launched and have a single instance running, let your alpha users know that there will probably be some downtime.

How about instead "alpha users should know that there will probably be some downtime". Multiple instances don't really fix that.

Nice move from DO to give everyone $500 credit. As I remember, they don't guarantee data safety (you still need backups even if they did). Double disk failure is a rare thing, but it happens.
Is DO apologizing here as a PR move, or do they make reliability claims that would lead you to think this sort of thing wouldn't happen?
DigitalOcean proudly advertises that they use SSDs... A dual drive failure with data loss should be very rare. I wonder what happened.
So they now run raid5? I remember they boasted about raid10 a while ago, now they silently downgraded to raid5 :)
What is the best/cost effectiv way to backup a windows server?
Backup the data and configuration information to an object store (AWS S3), use configuration management tools so you can programmatically provision a new server (dedicated or virtual, doesn't matter) in the event of failure. Provisioning should include functionality to deploy your application, and to restore your data to whatever data storage application (SQL, NoSQL, etc) you're using.

If you have questions, more than happy you provide free advice.

What are the best options for backing up DO externally?