Hacker News new | ask | show | jobs
by fmap 1116 days ago
I don't think many people are arguing against having arbitrary byte arrays for storage and using application specific serialization formats. The real problem with file systems, imo, is that they present a leaky abstraction over something that's internally very complex. Any single operation might look simple, but as soon as you start combining operations you're going to have a bad time with edge cases.

For example, let's say you need to ensure that your writes actually ends up on disk: https://stackoverflow.com/questions/12990180/what-does-it-ta...

The typical file abstraction introduces buffering on top of this, which adds additional edge cases: https://stackoverflow.com/questions/42434872/writing-program...

If you want ordered writes, then you have to handle this through some kind of journalling or at the application level with something like this: https://pages.cs.wisc.edu/~vijayc/papers/UWCS-TR-2012-1709.p...

And that's only something that's visible at the application level. There are plenty of similar edge cases below the surface.

Even if all of this works correctly, you still have to remember that very few systems give any guarantees about what data ends up on disk. So you often end up using checksums and/or error correcting codes in your files.

Finally, all this is really only talking about single files and write operations. As soon as you need to coordinate operations between multiple files (e.g., because you're using directories) things become more complicated. If you then want to abuse the file system to do something other than read and write arrays of bytes you will have to deal with operations that are even more broken, e.g., file locking: http://0pointer.de/blog/projects/locking.html

---

It's not an accident that you are using other services to store your files. For example, S3 handles a lot of the complexity around durable storage for you, but at a massive cost in latency compared to what the underlying hardware is capable of.

Similarly, application programmers often end up using embedded databases, for exactly the same reason and with exactly the same problem.

This is a shame, because your file system has to solve many of the same problems internally anyway. Metadata is usually guaranteed to be consistent and this is implemented through some sort of journaling system. It's just that the file system abstraction does not expose any of this and necessitates a lot of duplicate complexity at the application level.

---

Edit: After re-reading the grandparent comment, it sounds like they are arguing against the "array of bytes" model. I agree that this is usually not what you want at the application level, but it's less clear how to build a different abstraction that can be introduced incrementally. Without incremental adoption such a solution just won't work.

1 comments

Those are all good points. I will read the rest of the links!

My question is can those uncertainties be fixed with a less performant, ordered, and safe file system for typical application use. Then bleeding-edge with plenty of sharp edge cases for high performance compute work that application programmers can handle at app level? Because it is nuts how fast hardware and inexpensive RAM are and I think if you add +30% time on file write IO that will not greatly impact the user experience vs all the other causes of lag that burden us like network and bloat.

Then in the HPC word if a new byte cloud where all context is in some database with a magic index naturally comes to be we can move to that. I won't rule out needing to change the underlying file system because that's pretty over my head and there are good ideas I don't understand.

My point is to push against the proprietary format vendor lock-in file system abstractions like I get in nested objects in microsoft powerpoint or word or apple garage band where the app is merely wrapping files and hiding your actual data that you can pick up and move to another app. I don't want to need to adopt a way of thinking about pretty simple objects to use every different program.

I like wavs > flac, plain text > binary, constant bit rate > variable bit rate, sqlite > cloud company db (not really fair but just saying sqlite one-file db is amazing). Storage is inexpensive and adding in layers to decode the data runs a risk of breaking it and I like interoperability. Once you lose the file data and just have content clouds there might be compression running on the data changing the quality, e.g. youtube as a video store with successive compression algorithms aging old videos.

It drives me nuts when needing to attach things I'm faced with a huge context list where I'd rather go find a directory. Abstractions are just that, mental models to avoid the low level stuff. I'm still cool thinking of my information as file trees I think that's an OK level. But you're right complex operations with a file system has issues. I've messed up logging and file IO not thinking it through and it made me think about needing to fix my mistaken code.