Hacker News new | ask | show | jobs
by eatonphil 1300 days ago
Hey folks! Phil from TigerBeetle here. Happy to answer questions or pull in King when I cannot. :)
4 comments

Is there any attempt to reorder IO events such that writes (and maybe reads) operate in as contiguous of a manner as possible? It might only be relevant in the case of multiple writes to the same file, so the complexity trade-off might not be worth it for TigerBeetle, but might be worth it for other systems.

Edit: also, are you guys hiring? ;)

> Is there any attempt to reorder IO events such that writes (and maybe reads) operate in as contiguous of a manner as possible?

Something like a disk elevator at a higher level of the stack?

Yeah, even tweaking kernel IO scheduling would probably be sufficient. It'll depend on spinning rust vs SSDs though.
Yes, our thinking here was that for SSD or NVMe sometimes the cost of scheduling is not worth it, and "noop" can be a good choice, since the device is already so fast relative to the cost of the CPU and memory access required to schedule.

As far as we understand w.r.t. mechanical sympathy for flash, what counts most is either a large bulk I/O that the device can internally parallelize, or else sufficiently parallel smaller I/Os.

Then, being sector-aligned to avoid the kernel from having to fixup alignment with a memcpy—also, we're using Direct I/O so we have to—and especially trying to issue writes all with similar “Time of Death”. For example, if you're writing out the blocks for a new LSM table on disk, then it is in fact good to keep these close together, since it's likely they'll be compacted again and reclaimed at the same time in future.

This brings us back to scheduling... :)

We are definitely hiring a designer in Europe. :)

I asked DJ (not on HN, but hangs out in our community Slack [3] where you can ask further if curious), who knows the disk side of things best, and he responds:

The OS is free to reorder writes (this is true for both io_uring and conventional IO).

In practice it does this for spinning disks, but not SSDs.

The OS is aware of the "geometry" of a spinning disk, i.e. what sectors are physically close to each other.

But for NVME SSDs it is typically handled in the firmware. SSDs internally remap "logical" addresses (i.e. the address from the OS point of view) to "physical" addresses (actual locations on the SSD).

e.g. if the application (or OS) writes to block address "1" then "2", the SSD does not necessarily store these in adjacent physical locations. (OSTEP explains this well [0].)

"Performance Analysis of NVMe SSDs and their Implication on Real World Databases" explains in more detail:

> In the conventional SATA I/O path, an I/O request arriving at the block layer will first be inserted into a request queue (Elevator). The Elevator would then reorder and combine multiple requests into sequential requests. While reordering was needed in HDDs because of their slow random access characteristics, it became redundant in SSDs where random access latencies are almost the same as sequential. Indeed, the most commonly used Elevator scheduler for SSDs is the noop scheduler (Rice 2013), which implements a simple First-In-First-Out (FIFO) policy without any reordering.

Applications can help performance by grouping writes according to time-of-death (per "The Unwritten Contract of Solid State Drives" [2]), but the SSD is free to do whatever. We are shortly going to be reworking the LSM's compaction scheduling to take advantage of this: https://github.com/tigerbeetledb/tigerbeetle/issues/269.

[0] https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf

[1] https://www.cs.binghamton.edu/~tameesh/pubs/systor2015.pdf

[2] https://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

[3] https://join.slack.com/t/tigerbeetle/shared_invite/zt-1gf3qn...

Write (and read) request reordering happens on Windows with all disk types. In theory it shouldn't be, but in practice we have stats that show that it does. Just fyi.
Is there a reason why you’re not using libdispatch on darwin and instead using kqueue directly?
King from TigerBeetle here.

One of the reasons is that libdispatch's I/O functions introduce extra dynamic allocations for internal queueing via `dispatch_async` ([0],[1],[2]) and from an API perspective of realloc-ing [3] an internally owned [4] buffer.

TigerBeetle, on the other hand, statically allocates all I/O buffers upfront [5], treats these buffers as intrusively-provided typed data [6] (no growing/owned buffers), and does internal queueing without synchronization or dynamic allocation [7].

[0]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[1]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[2]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[3]: https://github.com/apple/swift-corelibs-libdispatch/blob/469...

[4]: https://developer.apple.com/documentation/dispatch/1388933-d...

[5]: https://tigerbeetle.com/blog/a-database-without-dynamic-memo...

[6]: https://github.com/tigerbeetledb/tigerbeetle/blob/d15acc663f...

[7]: https://github.com/tigerbeetledb/tigerbeetle/d15acc663f8882c...

Thank you! I was indeed looking for the technical details!
At the time, we wanted to get macOS or Darwin running as soon as possible to improve the local developer experience, but—we've always had a soft spot for FreeBSD so kqueue was two birds in one!
It'd be super cool to be able to use this as a standalone library — do you have an estimate as to how much work / time it might take to split this out?
Can this benefit BSDs?
I think FreeBSD invented kqueue and from a quick search it looks like OpenBSD and NetBSD also adopted it. We've seen at least one person slightly tweak TigerBeetle to run on FreeBSD already through the darwin code paths.
> We've seen at least one person slightly tweak TigerBeetle to run on FreeBSD already through the darwin code paths.

I'm part of that sample set! Was quite surprised how easy it was to get it up and running on FreeBSD. Benchmarking on tmpfs on both, it even had a ~10% lead on Linux.

(Of course, that's not exactly the intended use case, so don't pay too much attention to that number!)