| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by peterwwillis 2709 days ago
	RocksDB is a fork of LevelDB, which was [in]famous for its ease of corrupting data. Did Facebook ever do anything to ensure data wouldn't corrupt, or is that still a common thing operationally? (You find it more at larger scales) Here's an example of how data corruption can suck, with (example) Riak and LevelDB. The leveldb data would corrupt often, which would leave you in a predicament. Say you had 10 nodes with a 3 node replication factor, and the whole cluster is humming away at a decent clip. Now one node's leveldb corrupts, and you have to rebuild it. If you have a huge fuckoff dataset, this can take a while. Now another node goes down. Now only 1 node has the data you need, and 2 nodes are down - so now 8 nodes are doing the work of 10, and if you have any more failures, your data might be gone. Now add replication, which will suck performance and bandwidth away from the regular work. And because it would corrupt so easily & often, there needed to be hash trees to quickly identify what data was corrupt, and then you needed to fix it and rebuild your hash trees. This would also suck away performance. Finally, you can't just add new nodes while rebuilding, because the extra load makes the cluster fall over. And the more nodes, the higher the likelihood of failures.

4 comments

StreamBright 2709 days ago

I never experienced this with several in production Riak clusters running for years. Can you explain how to reproduce or give a link to any public forum where this was discussed?

link

peterwwillis 2709 days ago

Sure. Build about 10 classes of clusters of varying sizes, each with a dataset ranging from 100GB to a petabyte or more. Run them on shitty oversubscribed openstack clusters with a combination of ephemeral, Ceph, and SAN disks. Do replication to similar-ish clusters in different regions. Handle data for about 100 different applications that process so much data at such low latency that cloud-based databases aren't even an option. Keep adding nodes and storage to existing clusters over time.

It turns out that really unstable hardware/networks like to expose bugs. It also wasn't discussed in public forums. We paid for support and even employed Riak developers, and still we hobbled on putting out fires. I'll bet other DBs go through the same crap and keep it quiet.

Also, read the Riak documentation and you'll find the corruption recovery documentation among other hints at common failures and limitations.

link

StreamBright 2709 days ago

Thanks for confirming it wasnt a Riak issue. SAN disk really? As an architect i can tell you that SANs are almost always are antipattern for building reliable and scalable distributed systems.

link

peterwwillis 2708 days ago

I didn't know I was being interrogated about Riak failure modes. Ok, here's more verbiage on Riak failures.

First off, SAN was one of three different disk storage solutions. When you work for <BIG CORP>, the lowly product teams don't always get to pick and choose what infrastructure is available. They have to do the best they can with what they have, when they don't get what they ask for. If <BIG CORP> says to use a shitty openstack cluster, that's what you have to deal with, and you have to beg for all the ephemeral SSDs you can get. (Which then becomes a huge pain in the ass when you need to scale storage and your choice is (A) buy more machines and migrate nodes so you can upgrade ephemeral on the old machines or (B) start swapping disks in running hypervisors and cry yourself to sleep, or (C) use SAS or other array/volume on a SAN)

And thanks for blithely ignoring what I'm saying. Riak did have corruption bugs that should have been preventable - as I said, a major source of the problems was LevelDB, and Riak's own documentation shows this to be true.

You could look sideways at these things and the db would corrupt. A node with ephemeral storage, with no detectable errors on it whatsoever, would suddenly stop working. We would go look at it, and it had a single leveldb file corrupt... and nothing else wrong with it at all. Not only would it corrupt, it wouldn't make any attempt to fix itself, even though there was a documented fix.

Riak has anti-entropy intended to detect missing data and fix it, but it's the erlang equivalent of cron jobs and hash trees. The whole thing is designed to just go "dum de dum, I wonder if anything's broken after $INTERVAL?" and then perform some operation, which if the cluster is under load, may kick it over. So they added throttling (throttling is everywhere in Riak, as instead of simply rejecting operations because it's unsafe, they'd rather make everything go r e a l l y s l o w).

There was very little intelligence or event-driven programming for failure detection and remediation. When the db corrupted in a way that wasn't handled by anti-entropy, the node would just die, and we had to manually intervene (later by writing automation to intervene) rather than it, you know, just doing its own automation to fix the corruption. The AAE trees rebuild every $INTERVAL and there's no way to change when or how they rebuild other than to change the $INTERVAL, so there's no way to, for example, force them to rebuild when it is convenient based on lulls in application use.

Then there's Riak search, which has the nice habit of taking down your cluster due to god knows what (memory bloating, cpu starvation, unknown bugs in error logs, etc). Don't use Riak search.

Replication was also a joke. Any network interruption (hello, distributed apps have network interuptions) would kill replication. We would have to detect replication had failed and queues were filling up, and re-start the replication until queues fell. But sometimes replication couldn't resume, because there were 1 of 1,000 different potential failure modes happening with 1 node in a remote cluster somewhere. So we had to resolve that node's issue and get the whole remote cluster healed before the replication queues filled up. If we didn't do that, we'd have to do a full-transfer to prevent potential data loss, which would take days.

We developed auto-healing scripts to deal with most of these situations, and the controls Riak added to slow down processing so it didn't kill the cluster from all the competing operations it was trying to do at once (kv processing, replication, hash regeneration, etc) were not enough for our automation to be able to efficiently control the nodes when they were unhealthy. Riak would just occasionally perform incredibly poor, or nodes would die randomly, and we'd get some unknown errors we couldn't diagnose. All our monitoring and investigation showed nothing wrong with the host - no resource starvation, no error messages, no spikes of client traffic. Riak was just having a bad day, and us being a very small team of not-erlang-programmers, had to just restart shit until it got better, and research fixes once things improved. Our postmortem incident queue was rather large.

This is a small sampling of production Riak issues. I'm not going to dig into my brain for every bug they have, but suffice to say that a distributed database should be able to recover from a single file corruption, and should be able to resist it from ever happening through various techniques that are 20+ years old. Their code is just lame, and proof that just because you write something in Erlang doesn't mean it's going to be stable. And in Riak's defense, the reason why their code was lame was because they were a small company trying to juggle a lot of demanding engineering issues from different customers, and they didn't have much money or time. But lame code is still lame code.

link

viraptor 2709 days ago

> It turns out that really unstable hardware/networks like to expose bugs.

This sounds like a weird complaint to be honest. If you verify that your hardware is unstable, how can you expect the software not to fail and corrupt data?

> Also, read the Riak documentation and you'll find the corruption recovery documentation among other hints at common failures and limitations.

I'm not sure how that's a negative thing. You think about and document recovery even if you don't expect things to fail.

link

nicolast 2709 days ago

> If you verify that your hardware is unstable, how can you expect the software not to fail and corrupt data?

No such thing as 'stable hardware' at scale.

link

viraptor 2708 days ago

Ok, but then there's also no such thing as no corruption at scale.

If you accept imperfect hardware, you will get errors written to the drive. A single node of a database will get corruption by definition in that case. We're taking about RocksDB specifically here, so it is only one processing node.

How did you expect it to behave instead?

link

StreamBright 2709 days ago

There is, just the error rate is different between a SAN backed openstack cluster vs AWS for example. EC2 is reliable compare to what hw just described above.

link

manigandham 2709 days ago

What? It seems like you're blaming the db software for a range of external hardware and operations issues. The apps need "so much data at such low latency" yet couldn't have a proper running environment?

link

zzzcpan 2709 days ago

As far as I'm aware, none of the embedded DBs can help with fast recovery after corruption or any partial recovery. Sucks, but tolerable on local networks and small nodes. For spinning disks or far away non local nodes this of course doesn't work well and you have to implement your own data store.

link

SamReidHughes 2709 days ago

I think the problem the parent was referring to was the database corrupting itself, or returning corrupt results, not it getting corrupted by an outside agency.

There are certain ways LSM trees can be screwed up in implementation, but the attack surface for corruption is relatively small, so I would not be surprised if that got plugged up. There's room for noobs to put in bugs, but a small enough surface area for a comprehensively cleanup to happen later. So my attitude about RocksDB is that I'm not too worried about LevelDB's history.

But I'm just a RocksDB user and have worked on LSM stores in the past, and I'm giving you my feelings and impressions, I don't have inside knowledge here.

link

teacpde 2709 days ago

Curious what is the cause of data corruption in leveldb?

link

nickpsecurity 2709 days ago

That kind of concern is probably why FoundationDB built on and modified SQLite. Its reliability is already great.

link