Hacker News new | ask | show | jobs
by dmw_ng 1540 days ago
> database is an append-only file of JSON objects separated by newlines. When the app restarts, it reads the file and rebuilds its memory image. All data is in RAM

Apps like this tend to perform like an absolute whippet too (or if they dont, getting them to perform well is often a 5 line change). It's really freeing to be able to write scans and filters with simple loops that still return results faster than a network roundtrip to a database.

The problem is always growth, either GC jank from a massive heap, running out of RAM, or those loops eventually catching up with you. Fixing any one of these eventually involves either serialization or IO, at which point the balance is destroyed and a real database wins again.

3 comments

Another issue with "just a JSON file" as a database is that you need to be a bit careful to avoid race conditions and the like, e.g. if two web pages try to write the same database at the same time. It's not an issue for all applications, and not that hard to get right, but does require some effort. This is a huge reason I prefer SQLite for simple file storage needs.
A normal Express app (assuming it's one process per JSON file) shouldn't have that problem, because JavaScript is single-threaded
It can definitely be a problem in Node.js. Assuming the workflow is read from disk -> modify -> write to disk, and that you're using the async fs functions, two async code paths running at the same time will have last-write-wins semantics and will lose data.

That's the naive scenario. If all code paths write out a global data structure, then it'd be fine. Or if the file is written append-only instead of as a single, atomic data structure, then it could be fine.

You are confusing parallelism with concurrency. It definitely can be a problem.
Is it possible a write is interrupted on it's turn in the event-loop, and crossed with another?
Hmm. I wouldn't think so, but I don't actually know

Still, given the strategy at hand, the in-memory JS object (exclusively single-threaded) is the source of truth, and just gets mirrored in the file system (and doesn't get read again until the next startup). So you should have an eventual-consistency situation in the worst case (any racing issue between file-writes would just put the file in a stale state, and the next file-write would bring it back up to consistency)

Doesn't the fact that its opened in append only mode (Linux) mitigate data races with regards to writes?
Your write will be fine; that is, it's not as if data from one write will be interspersed with the data from another write. It's just that the order might be wrong, or opening the file multiple times (possibly from multiple processes) could be fun too. The program or computer crashing mid-write can also cause problems. Things like that.

Again, may not be an issue at all for loads of applications. But I used a lot of "flat file databases" in the past, and found it's not an issue right up to the point that it is. Overall, I found SQLite simple, fast, and ubiquitous enough to serve as a good fopen() replacement. In some cases it can even be faster!

> Your write will be fine; that is, it's not as if data from one write will be interspersed with the data from another write.

Are you sure? I thought it could be if the first write had more data than the size of the kernel/fs-driver buffer, not all of it would be written, and then it could be interrupted when another thread calls write() with a small buffer that gets written in one go.

No, I'm not sure haha; but in my experience it usually works like that, but no doubt there could be edge cases there, too. Another good reason to use SQLite.
Here is my list of numbers: 1,Here is my list of letters: a,b,2,3,d
Although not a POSIX requirement, in practice for unix-like systems, file writes are atomic across concurrent writers.

You maybe thinking of stdio buffering, where calls to printf etc get split into multiple write calls. Then in those cases, it's possible to get errant interleaved writes.

It eliminates them if they're smaller than PIPE_BUF (IIRC, Beltalowda, dmoy, and stevenhuang are wrong about this), but the thing that prevents data races with regard to writes is running the application in Node, which is completely single-threaded.
> The problem is always growth, either GC jank from a massive heap, running out of RAM, or those loops eventually catching up with you

Absolutely. The challenge is having enough faith that it will take long enough to catch up to you.

Statistically speaking, it won't catch up to you and if it does, it will take so long you should have seen it coming from miles away and had time to prepare.

In my systems that use an in-memory/append-only technique, I try to keep only the pointers and basic indexes in memory. With modern PCIe flash storage, there is no good justification for keeping big fat blobs around in memory anymore.

Could you expand what you mean by keeping pointers and basic index in memory?
Pointers are tuples of (Id, LogOffset) and are used to map logical identities to positions of those objects in the append-only log.

Indexes are usually a tuple of (Some64BitKey, Id) and are used to map physical business keys to logical object identities. These entries are only candidates in the case where the key material needs to be hashed and inspected for actual equivalence.

One big advantage with this approach is that you can stream big blobs directly out of the log to a caller-supplied buffer or stream. No intermediate allocations required aside from some small buffers.

Awesome, thank you for sharing!
Yes, you need to be sure that you understand the growth pattern if you want to YOLO in RAM. If your product aims to be the next Instagram, this is clearly not the architecture.

But a lot of small businesses are genuinely small. They may not sign up new customers that often. When they do, the impact to the service is often very predictable ("Amy at customer X uses this every other day, she's very happy, it generates 100 requests / week"). If growth picks up, there would be signs well in advance of the toy service becoming an actual problem.

> If your product aims to be the next Instagram, this is clearly not the architecture.

But maybe! https://instagram-engineering.com/dismissing-python-garbage-...