Hacker News new | ask | show | jobs
by josephg 519 days ago
It doesn’t have to be one or the other. Developers could decide by passing flags to open.

But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.

The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.

It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.

2 comments

Packages consist of multiple files. An atomic file write would not allow packages to be either installed or not installed by APT.
Atomicity could encompass a whole bunch of writes at once.

Databases implemented atomic transactions in the 70s. Let’s stop pretending like this is an unsolvable CS problem. Its not.

That is not what an atomic write() function does and we are talking about APT, not databases.

If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.

> Databases implemented atomic transactions in the 70s.

And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).

Eh. Deadlocks can be avoided if you don’t use sql’s exact semantics. For example, foundationdb uses mvcc such that if two conflicting write transactions are committed at the same time, one transaction succeeds and the other is told to retry.

It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).

Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.

This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.
It can also use ZFS to do this.
> Developers could decide by passing flags to open.

Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.

> you’ll need enough free space to store both the old and new versions of your data.

The sacrifice is increased write wear on solid state devices.

> It would allow all sorts of useful programs to be written easily

Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.

Dropbox reversed its stance on this. It added support for ZFS, XFS, ecryptfs and btrfs:

https://help.dropbox.com/installs/system-requirements

They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.

Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?

As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?

And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.

Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.