Hacker News new | ask | show | jobs
by Retr0id 519 days ago
> why the file API so hard to use that even experts make mistakes?

I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.

4 comments

I take a different view on this. IMO the tricks that existing file systems play to get more performance (specifically around ordering and atomicity) make it extra hard for developers to reason about. Obviously, you can't do anything about fsync dropping error codes, but some of these failure modes just aren't possible over file systems like NFS due to protocol semantics.
Not only that, but the POSIX file API also assumes that NFS is a thing but NFS breaks half the important guarantees of a file system. I don’t know if it’s a baby and bath water situation, but NFS just seems like a whole bunch of problems. It’s like having eval in a programming language.
The whole software ecosystem is built on bubblegum, tape, and prayers.
What aspects of NFS do you think break half of the important guarantees of a file system?
Well, at least O_APPEND, O_EXCL, O_SYNC, and flock() aren't guaranteed to work (although they can with recent versions as I understand it).

UID mapping causing read() to return -EACCES after open() succeeds breaks a lot of userland code.

Lack of inotify support is one that has annoyed me in the past. It not only breaks some desktop software, but it also should be possible for NFS to support (after all, the server sees the changes and could notify clients).
Thanks for this, it's helpful. Totally heard about O_APPEND and read() returning -EACCESS. The other ones, I agree, should be fixed in later versions of the Linux kernel/NFS client.
Just ran into this one recently trying to replace Docker w/ Podman for a CICD runner. Before anyone protests we have very strong, abnormal requirements on my project preventing most saner architectures. It wasn’t the root cause but the failure behavior was weird due to the behavior you just described.
POSIX is also so old and essential that it's hard to imagine an alternative.
Not really, there's been lots of APIs that have improved on the POSIX model.

The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).

I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).

> Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly.

    int fd = open(".config", O_RDWR | O_CREAT | O_SYNC_ON_CLOSE, 0o666);

    // effects of calls to write(2)/etc. are invisible through any other file description
    // until the close(2) is called on all descriptors to this file description.

    close(fd);
So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!
Surely this can’t always be true?

What happens when a lot of data is written and exceeds the dirty threshold?

It gets written on the disk but into different inodes, I imagine.
It's not hard to design a less bug-prone API that would enable you to do everything the POSIX file API permits and admits equally-high-performance implementations. But making that new API a replacement for the POSIX API would require rewriting essentially all of the software that somebody cares about to use your new, better API instead of the POSIX API. This is probably only feasible in practice for small embedded systems with a fairly small universe of software.
You could do a phased transition, where both the legacy posix api and the new api are available. This has already happened with a lot of the old C standard library. Old, unsafe functions like strcpy were gradually replaced by safer alternatives like strncpy.

Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.

There are two serious factual errors in your comment:

- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.

- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)

You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.

The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.

Yeah, I understand that those methods are still available. But their use is heavily discouraged in new software and a lot of validators & sanitisers will warn if your programs use them. Software itself has largely slowly moved to using the newer, safer methods even though the old methods were never taken away.

I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?

Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.

NVMe has no barrier that doesn't flush the pipeline/ringbuffer of IO requests submitted to it :(
The kernel could implement a non-flushing barrier, even if the underlying device doesn't. You could even do it without any barrier support at all from the underlying device, as long as it reliably tells you when each request has completed; you just don't send it any requests from after the barrier until all the requests before the barrier have completed.
Writes in the POSIX API can be atomic depending on the underlying filesystem. For example, small writes on ZFS through the POSIX API are atomic since they either happen in their entirety or they do not (during power failure), although if the writes are big enough (spanning many records), they are split into separate transactions and partial writes are then possible:

https://github.com/openzfs/zfs/blob/34205715e1544d343f9a6414...

Writes on ZFS cease to be atomic around approximately 32MB in size if I read the code correctly.

> make whole file read/writes atomic with a copy-on-write model,

I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?

> eliminate whole classes of filesystem bugs pretty quickly.

Block level deduplication is notoriously difficult.

> where the only kind of write allowed is to atomically append a chunk of data to the file

Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.

It doesn’t have to be one or the other. Developers could decide by passing flags to open.

But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.

The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.

It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.

Packages consist of multiple files. An atomic file write would not allow packages to be either installed or not installed by APT.
Atomicity could encompass a whole bunch of writes at once.

Databases implemented atomic transactions in the 70s. Let’s stop pretending like this is an unsolvable CS problem. Its not.

This is kind of an interesting thought that more mirrors how Docker uses OverlayFS to track changes to the entire file system. No need for new file APIs.
> Developers could decide by passing flags to open.

Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.

> you’ll need enough free space to store both the old and new versions of your data.

The sacrifice is increased write wear on solid state devices.

> It would allow all sorts of useful programs to be written easily

Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.

Dropbox reversed its stance on this. It added support for ZFS, XFS, ecryptfs and btrfs:

https://help.dropbox.com/installs/system-requirements

They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.

Yes, a feature like this would need cooperation with the filesystem. But that’s just an implementation problem. That’s like saying we can’t add flexbox to browsers because all the browsers would need to add it. So?

As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?

And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.

Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.

Some of the problems transcend POSIX. Someone I know maintains a non-relational db on IBM mainframes. When diving into a data issue, he was gob-smacked to find out that sync'd writes did not necessarily make it to the disk. They were cached in the drive memory and (I think) the disk controller memory. If all failed, data was lost.
This is precisely why well-designed enterprise-grade storage systems disable the drive cache and rely upon some variant of striping to achieve good I/O performance.
Just wait till he has to deal with raid controllers.
I use Plan 9 regularly and while its Unix heritage is there, it most certainly isn't Unix and completely does away with POSIX.
> POSIX fs APIs and associated semantics

Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.