Ash HN: What if we use file system as database? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Ash HN: What if we use file system as database?

11 points by xyheme 990 days ago

For example:

users/{user}/index.json

users/{user}/projects/{project}/index.json

14 comments

nicbou 990 days ago

I'm going back to the file system as a source of truth with static site generators.

The main reason is that I can use a variety of tools to view and manipulate files.

But other software can also mess with your "database", including the operating system itself. For example, Syncthing was breaking my static site by changing the unicode normalisation of file names. MacOS being case insensitive brought issues. Illegal filename characters brought issues.

This approach works well if you want to make data accessible to humans, but it's wildly inefficient if you expect machines to operate on that data, and comes with a few caveats.

Pick the right tool for the job.

xyheme 990 days ago

> The main reason is that I can use a variety of tools to view and manipulate files.

This is also the reason I am researching this old "What if" problem.

Because I think comparing to some version of MySQL or PostgreSQL, files are stable, open and timeless.

nicbou 989 days ago

This is exactly why I chose that approach for my websites. My website content and its backups are source-controlled, human-readable files. I can edit them with the tools I love, not just what's supplied with the content management system.

I'm now rebuilding my timeline thing [0] with the filesystem as the database. However, I still use an SQLite database as intermediate storage, because extracting metadata from thousands of photos is not cheap.

In other words, you'll need to build your own cache, and sync it with your filesystem. Making data human-readable makes it slower for machines to read.

[0] https://nicolasbouliane.com/projects/timeline

VoodooJuJu 990 days ago

For simple apps and app components, it's very convenient and manageable.

It becomes a problem when you: (1) scale up (2) have to deal with multiple relationships between objects. The "Database design" by Adrienne Watt posted in another comment covers the scale concerns well, but another scale problem she doesn't mention is hitting inode limit, at least if you're on a single machine. You can of course use a distributed filesystem as database, but at that point, you might want to use a database proper.

xyheme 990 days ago

Inode limits depends on the file system you use.

I am using Btrfs.

Btrfs inode limits is in a whole different league (whereas ext4's inodes are allocated at filesystem creation time and cannot be resized after creation, typically at 1-2 million, with a hard limit of 4 billion, btrfs's inodes are dynamically allocated as needed, and the hard limit is 2^64, around 18.4 quintillion.

> -- https://unix.stackexchange.com/questions/18388/what-are-the-...

warrenm 990 days ago

I didn't realize btrfs was still around

I haven't seen anything but ext4 or xfs in over a decade

xyheme 990 days ago

Me and a lot of linux user friends are using btrfs.

warrenm 990 days ago

cool

nice to know it still exists :)

jqpabc123 989 days ago

... hitting inode limit

Use a better file system with the ability to "scale up".

Attempting to predict and limit the max. number of allowed files at the time of creation is an unbelievably audacious yet hamstrung and self limiting design --- one that is totally unnecessary and as you point out, doomed to fail at some point --- often before storage and address space are depleted.

I find this particularly egregious in an era of constantly increasing storage demands, changing volume capacities and drive pooling in an OS often promoted for it's server prowess and flexibility.

torunar 990 days ago

This chapter from "Database design" by Adrienne Watt has a very good explanation why this approach didn't stick: https://opentextbc.ca/dbdesign01/chapter/chapter-1-before-th...

jqpabc123 990 days ago

All he did was iterate why large enterprises migrated toward the RDBMS.

A simple web service designed to address a single problem is not a "large enterprise". And applying an RDBMS to such problems is like using a hand grenade to kill a fly. It will work --- but there are better, more efficient and appropriate approaches.

In other words, not every problem needs to be generalized to encompass the global economy.

xyheme 990 days ago

@jqpabc123

I agree.

We (developers) should also design tools to address the need of small scale web apps.

Maybe, we do not even need some general tools to use file system as database, just speak this method more often.

xyheme 990 days ago

Thanks for the link.

I read it, and I think it describes bad ways of using file system, but there are also good ways.

Just like there are bad ways of using SQL databases, and there are also NoSQL, and bad ways of using NoSQL databases.

eimrine 989 days ago

Thank you for the beginner SQL book.

jryan49 990 days ago

Why not, is because you want ACID guarantees (ie, what if two people are editing a file at once), ability to scale, or a query language.

xyheme 990 days ago

That all can be achieved by using file system as database.

Not even complicated to implement.

See couchDB for how to handle "what if two people are editing a file at once".

jryan49 990 days ago

You're not getting that just from the file system though. Depending on the performance requirements and use case it may or may not be complicated. You are going down the rabbit hole of creating your own foundation for a database at that point.

Daeraxa 990 days ago

I've been using a file based database at work for years. Data is stored in 1024 byte "blocks" with each file containing 1000 blocks. Each block refers to a single record and the data is delimited by the positions within those 1024 bytes (e.g. date/time created might be at position 38 in the block and run for 14 bytes so we reference it by its "token" of 380014). Each file will be the first 4 characters of the 8 character long primary IDs of each block. E.g. file 1020.dat would contain records 10200000 -> 10209999. We then have all kinds of other files used for indexing and data locations that are built by tools over those files as well as overflow data files when those 1024 bytes just weren't enough.

I should point out this is a legacy system and for very good reasons we moved to an actual database a long time ago.

xyheme 990 days ago

Why not use one JSON file as one record of data, and use pathname as primary ID?

Daeraxa 990 days ago

Because it predates JSON by over a decade?

xyheme 990 days ago

I should thought about that.

duped 990 days ago

In your example you're kind of mixing a DB within a DB (the index.json file is a separate database contained within the main database).

A better structure would be something like

    users/<user>/projects/<name>/data/... etc

Now your file system is just a NoSQL database. All that data you would dump into an index.json can be stored in the file tree.

That would actually work pretty well as long as you limit operations that are kind of meaningless and disable features like symlinks.

Mounting hierarchical data as a pseudo file system is actually pretty common. EG procfs, devfs, sysfs are all pseudo file system that present structured data to applications on Linux through the guise of a file tree.

xyheme 990 days ago

JSON is there for human to read a record of data by group.

I think it is only good to use directory as data table to store JSON data files, but not for JSON data properties.

duped 990 days ago

> JSON is there for human to read a record of data by group.

That's what tree is for

> I think it is only good to use directory as data table to store JSON data files, but not for JSON data properties.

I see that as no better than one big JSON file or a normal NoSQL database.

If you're really fancy then the whole DB is just a VFS and those JSON files aren't real files, just a serialized form of the DB.

xyheme 990 days ago

directory tree is not as easy to view and edit as JSON.

One big JSON file is harder.

I am not fancy, and the aim is to simply use file system as database.

Not fancy stuff like "database as file system as database".

duped 990 days ago

> directory tree is not as easy to view and edit as JSON.

    tree path/to
    cat path/to/key
    echo "1" > path/to/key

But my point is that you're not using the file system as a database. You're using it as an index, and haven't considered about multiple readers/writers to those individual JSON files that are doing the real work as databases. It's kind of like writing JSON into a SQL table. You can do it, but probably not to store important data within that JSON that always needs to be queried and ser/deserialized for any kind of read or write. If you need that, you probably want NoSQL.

belter 990 days ago

Well you can use a database as the file system...That is what mobiles do. They use SQLite.

ilaksh 990 days ago

For many years, I built most projects on top of a relational DB.

Then NoSQL happened.

After NoSQL became less popular, I started defaulting to JSON files, if I thought I could make it work.

One thing that helps a file-based DB is to make sure you put something useful in the filename (and/or path). Such as a search tag and/or category or owner.

I suspect that the majority of programs are like the ones I write in that they don't have a ton of demands on them, relatively speaking. Not in terms of load or scope or anything else.

In most of these "small" applications, I think you can literally use almost _any_ database or file structure you want and end up with basically equivalent results.

xyheme 990 days ago

Thanks for the reminder about meaningful pathnames :)

badpun 990 days ago

For one, with a 1 file per row (entity) design, any larger query would require reading from potentially millions of files, which I imagine would be nightmarishly inefficient.

xyheme 990 days ago

I think, unless column-oriented database is used, "one file per row" is similar to normal row-oriented database.

Maybe there are some benchmarks in the past.

jqpabc123 990 days ago

I have done this for simple web apps in the past.

And I have had programmers who used the resulting apps call me a liar and tell me it's not possible.

The file system is itself a multi-user indexed data store built right into the OS kernel. However, it obviously is not a relational DBMS.

So if your needs are limited and can be satisfied by a simple indexed data store, this can work remarkably well.

Even though some "software engineers" will tell you it can't because they have been trained differently.

romanhn 990 days ago

Reminds me of WinFS, the Windows filesystem built on top of a relational database [0] that was supposed to ship with Longhorn (aka Windows Vista). Neat idea that never materialized.

[0] https://en.m.wikipedia.org/wiki/WinFS

jqpabc123 990 days ago

MS NTFS has always used B+trees to index files.

A full relational database for a file system is another example of the DB overkill that is common in the industry.

Teach people to use a hammer and every problem becomes a nail.

Apparently, more rational heads at MS eventually prevailed.

xyheme 990 days ago

Isn't WinFS the reverse of "file system as database"?

Which is "database as file system".

revskill 989 days ago

It's fine as soon as the OS gives you "API", like searching, indexing, garbage colleciton programatically.

theandrewbailey 990 days ago

As long as you don't run into inode or file count limits, a filesystem is a valid way to structure data like this.

pestatije 990 days ago

why index.json? you already have the file system's directory list of files

xyheme 990 days ago

So that I can use subdirectory to model "has many" relationship, like "a user has many projects".