Hacker News new | ask | show | jobs
by cperciva 1979 days ago
There's an even better reason for databases to not write to memory mapped pages: Pages get synched out to disk at the kernel's leisure. This can be ok for a cache but it's definitely not what you want for a database!
1 comments

That's what msync() is for.
If you're tracking what needs to be flushed to disk when, you might as well just be making explicit pwrite syscalls.
Right, but it can sync arbitrary ranges sooner, which is also awful for consistency.
Shouldn't your write strategy be resilient to that kind of stuff (eg. shutdown during a partial update) ?
Don't you need exact guarantees on write ordering to achieve that?
Yes, for almost all databases, although there was a cool paper from the University of Wisconsin Madison a few years ago that showed how to design something that could work without write barriers, and under the assumption that disks don't always fsync correctly:

"the No-Order File System (NoFS), a simple, lightweight file system that employs a novel technique called backpointer based consistency to provide crash consistency without ordering writes as they go to disk"

http://pages.cs.wisc.edu/~vijayc/nofs.htm

Does that generalize to databases? My understanding is that file systems are a restricted case of databases that don’t necessarily support all operations (eg transactions are smaller, can’t do arbitrary queries within a transaction, etc etc).
You can do write/sync/write/sync in order to achieve that. It would be nicer to have FUA support in system calls (or you can open the same file to two descriptors, one with O_SYNC and one without).
I think you mean mlock