|
|
|
|
|
by mightyham
36 days ago
|
|
Unless I am mistaken, it seems like there is a glaring flaw in this scheme, which is that without fsync you cannot guarantee the previous WAL blocks have been persisted before the current one, so a power loss event could leave a hole in the log and cause erroneous recovery. I believe that SSDs reorder writes internally so even having atomic batched O_DIRECT is not a strong enough guarantee for durability. I'll admit that I could be misunderstanding something about the system that alleviates this concern. |
|
My guess is the preallocation + zeroing is what got them most of the win, and the O_DIRECT is actually hurting, not helping throughput. This has been the case 100% of the time I've benchmarked such things.
If you're doing this sort of stuff for real under Linux, check out sync_file_range. It's the only non-broken and performant sync API for ext4 (note that it's broken by design for many other file systems, and the API is terribly difficult to use correctly).
If you really care, it's probably just easier to use SPDK or something. Linux has historically been pretty hostile towards DBMS implementations.