Hacker News new | ask | show | jobs
by ravenstine 2161 days ago
Is there any advantage in writing a custom file system for a niche purpose? It seems like most file systems are just different variations of managing where/when files are written simultaneously. Could a file system written specifically for something like PostgreSQL cut out the middle-man and increase performance?
3 comments

Yes. Oracle has done this (ASM) to eliminate overhead, implement fault tolerance and provide a storage management interface based on SQL, for example.

I once made a 'file system' to mount cpio archives (read-only) in an embedded system. Cpio is an extremely simple format to generate and edit (in code) and mounting it directly was very effective.

I suspect operating on block storage directly may both be easier and more reliable for databases, since about 75 % of the complication in writing transactional I/O software is working around the kernel's behavior.
Kernel's fsyncing behavior is one thing, but just relying on a massive amount of fragile C code running in kernel is a significant liability, especially if your software is a centralized database and crashes, panics will bring down everything.
Yes, and also the traditional answer was that the kernel handles weird and complicated hardware and can talk to RAID controllers properly, but nowadays hardware has much less variance, and RAID is rare (and arguably unnecessary for a direct-IO database).

I think it'd be viable for an enterprise-y database to do IO directly over NVMe. Imagine the efficiency and throughput gains you could get from a database that (1) has a unified view of memory allocation in the system (2) directly performs its page-level IO on the storage devices.

Wow this comment just made me fall down a rabbit hole. I've only just surfaced. The Kaitai project actually comes with some pre-defined bindings for cpio which meant I was up and running very quickly.

https://formats.kaitai.io/cpio_old_le/index.html

Yes, cpio is ancient and simple and very easy to work with. The Kaitai project didn't exist at the time; I used the C structs documented in man pages.
You may be interested in a paper written by the Ceph team: "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution"

https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

There are definitely some significant benefits you can get from managing your own storage, rather than using a filesystem.

Yes, this is common in database engines. Doing so allows you to optimize the file system along a very different set of performance tradeoffs and assumptions than a typical generic file system. Beyond that, it also gives you direct control of file system behavior, the lack of which is a source of code complexity and edge cases. This is not transparent to the database, something like PostgreSQL would need to have its storage layer redesigned to explicitly take advantage of the guarantees.

It isn't just about performance gains, which are substantial, it also greatly simplifies the design and code by eliminating edge cases, undesirable behaviors, and variability in behavior across different deployment environments.