Hacker News new | ask | show | jobs
by betawaffle 3672 days ago
Torus is append-only too. We also plan to support something more like what Facebook's paper describes, where they have extra parity (xor) to support more efficient local repair.
1 comments

How? I thought you're exporting a block device, not a filesystem? You can't append to a block device, and certainly every filesystem out there expects block devices to be random-writable, right?
The "interface" we're exporting is very different from the underlying storage. The block device interface we currently provide supports random writes just fine, but the underlying storage we use (which involves memory-mapped files) is append-only. Once written, blocks are only ever GC'd, not modified.
So if I were to run a database on this, wit a lot of overwrites, the storage would grow infinitely?

Secondly, this implies you are remapping the LBA (offsets) all the time, perhaps taking what would be sequential access and turning it into random? That sounds pretty painful.

Nope, previous block versions get GC'd. I don't see how LBAs have any relevance here... you're talking about a much lower layer than what Torus is operating on.
You're providing a block device interface to the container. The container's FS is addressing LBAs. Sequential reads to the container's adjacent LBAs get turned into reads to whatever random Torus node is storing the data, based on when it was last written...
Exactly what you said. Torus is exposing block on top of what could be described as a log structured FS. So while you may not know about LBAs, there are LBAs involved. I took a look at the code and you are putting a FS like ext4 on top of your block device. Any time an LBA is written to, you append to your store. This causes sequential access to become random, and in addition causes unneeded garbage collection issues.

Further more, it appears to me that etcd is now in the "data path" That is, in theory, each access could end up hitting etcd.

If so, I really would question why anyone would do this at all... this is not how any storage system is written.

The problem here is that you are trying to do block on a file system. This is a bigger problem than you can imagine and while you may think lbas are not involved, there actually are. You are naively taking on a well known area in storage
Ok, so that plus a little MVCC can make you consistent, but you've still got the read-many-to-write-one thing from the perspective of your block device interface, right? And block devices, if I'm remembering right, don't leave you any room to buffer pending writes.
Torus implements a kind of MVCC, yes. As for read-many-to-write-one, I assume you're talking about Reed-Solomon or similar erasure coding? There have been some papers written about ways to reduce that, a good one is from Facebook: https://code.facebook.com/posts/536638663113101/saving-capac.... And that's just one option. Also, this is all speculative since we have yet to implement erasure coding.
Don't know if you'll see this, but:

If only one host at a time has access to a given virtual block device, there are some opportunities to buffer outgoing writes with a write-through cache. That might be the way to go if you explore erasure coding in the future.

Well, good luck with it. Don't get me wrong, I'd love 1.5x redundancy overhead instead of 3x. But even if you have to downgrade to offering either replication or XOR, it's still a huge missing piece of the typical container deployment, so good luck.