|
|
|
|
|
by malisper
2107 days ago
|
|
So I understand the rationale for writing your own storage layer and think this is an awesome project, but there's something missing for me. One of the issues Peter brings up is they've come across a number of serious bugs in RocksDB. My question is, why would Pebble have less bugs. In fact, I would expect it to have significantly more bugs because Coackroach is the only company using Pebble. They mention briefly how they are going about randomized crash testing: > The random series of operations also includes a “restart” operation. When a “restart” operation is encountered, any data that has been written to the OS but not “synced” is discarded. Achieving this discard behavior was relatively straightforward because all filesystem operations in Pebble are performed through a filesystem interface. We merely had to add a new implementation of this interface which buffered unsynced data and discarded this buffered data when a “restart” occurred. but this seems to only scratch the surface of possibilities that can come up with a crash. For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk. Bugs in this area are still regularly found in e.g. Postgres, so I'm having a hard time seeing how Coackroach is making sure Pebble doesn't have similar problems. |
|
We're only worried about functionality in Pebble used by CockroachDB. RocksDB has a huge number of features that sometimes have bugs due to subtle interactions. There is a very stable subset of RocksDB: the configuration and specific API usage patterns used internally by Facebook. That precise combination has seen extreme testing. But that isn't the subset of RocksDB used by CockroachDB. I would guess that the most significant testing of the subset of RocksDB used by CockroachDB is the testing we do at Cockroach Labs. Now that testing is being directed at Pebble along with the Pebble-specific testing detailed in the post.
> For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.
The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).