| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by betawaffle 3672 days ago
	Torus is append-only too. We also plan to support something more like what Facebook's paper describes, where they have extra parity (xor) to support more efficient local repair.

1 comments

jbooth 3672 days ago

How? I thought you're exporting a block device, not a filesystem? You can't append to a block device, and certainly every filesystem out there expects block devices to be random-writable, right?

link

betawaffle 3672 days ago

The "interface" we're exporting is very different from the underlying storage. The block device interface we currently provide supports random writes just fine, but the underlying storage we use (which involves memory-mapped files) is append-only. Once written, blocks are only ever GC'd, not modified.

link

stevelandiss 3672 days ago

So if I were to run a database on this, wit a lot of overwrites, the storage would grow infinitely?

Secondly, this implies you are remapping the LBA (offsets) all the time, perhaps taking what would be sequential access and turning it into random? That sounds pretty painful.

link

betawaffle 3672 days ago

Nope, previous block versions get GC'd. I don't see how LBAs have any relevance here... you're talking about a much lower layer than what Torus is operating on.

link

gregsfortytwo 3672 days ago

You're providing a block device interface to the container. The container's FS is addressing LBAs. Sequential reads to the container's adjacent LBAs get turned into reads to whatever random Torus node is storing the data, based on when it was last written...

link

curtisptrsn 3672 days ago

Exactly what you said. Torus is exposing block on top of what could be described as a log structured FS. So while you may not know about LBAs, there are LBAs involved. I took a look at the code and you are putting a FS like ext4 on top of your block device. Any time an LBA is written to, you append to your store. This causes sequential access to become random, and in addition causes unneeded garbage collection issues.

Further more, it appears to me that etcd is now in the "data path" That is, in theory, each access could end up hitting etcd.

If so, I really would question why anyone would do this at all... this is not how any storage system is written.

link

ntspusr1 3671 days ago

The problem here is that you are trying to do block on a file system. This is a bigger problem than you can imagine and while you may think lbas are not involved, there actually are. You are naively taking on a well known area in storage

link

jbooth 3672 days ago

Ok, so that plus a little MVCC can make you consistent, but you've still got the read-many-to-write-one thing from the perspective of your block device interface, right? And block devices, if I'm remembering right, don't leave you any room to buffer pending writes.

link

betawaffle 3672 days ago

Torus implements a kind of MVCC, yes. As for read-many-to-write-one, I assume you're talking about Reed-Solomon or similar erasure coding? There have been some papers written about ways to reduce that, a good one is from Facebook: https://code.facebook.com/posts/536638663113101/saving-capac.... And that's just one option. Also, this is all speculative since we have yet to implement erasure coding.

link

jbooth 3669 days ago

Don't know if you'll see this, but:

If only one host at a time has access to a given virtual block device, there are some opportunities to buffer outgoing writes with a write-through cache. That might be the way to go if you explore erasure coding in the future.

link

jbooth 3671 days ago

Well, good luck with it. Don't get me wrong, I'd love 1.5x redundancy overhead instead of 3x. But even if you have to downgrade to offering either replication or XOR, it's still a huge missing piece of the typical container deployment, so good luck.

link