You missed a number of steps. The transactions are independent so they signal completion (to trigger the commit fsync) independently.
You can have the first transaction wait a bit to see if any other commits can be batched in the same fsync. However that’s off by default as the assumption is you want the transaction to complete as fast as possible.
The clever way to do this is to immediately commit the first transaction when the storage engine is idle.
While it is waiting for the fsync to finish it should batch up any incoming WAL writes and then issue the next fsync immediately after the first one finishes, committing the entire batch at once. Then, and only then, it can reply to clients with “transaction complete”.
Some modern database engines now do this, by many older ones don’t because too much of their code assumes one transaction per fsync.
You still have to wait for the final fsync which is only requested after the transaction work has completed. So not sure you’re gaining much if at all from this.
The key limit is the rate of fsyncs, which is constrained by the user mode to kernel mode transition and physical characteristics of the storage device. In the good old days, it was about a millisecond due to spinning disk latencies, but even on the best SSDs it's about 200 microseconds. This is only about 5K transactions per second, maximum, no matter how trivial the transactions are!
With automatic batching, trivial transactions can be grouped together so that the bottleneck becomes bandwidth, not an absolute rate.
You get to have your cake and eat it too: There's no additional latency added using automatic batching of transactions because when the I/O queue is empty, the next transaction commits immediately, same as normal. If the disk is already in the middle of an fsync, the next one will have to queue up behind it in the storage subsystem anyway, so the DB engine may as well accumulate more transactions in-memory while it is waiting.
Sure, but now when transaction 1 is "committed", it isn't actually guaranteed to be there in the face of interruption. That's a big change to the default behavior...
There was always a gap between "write to disk" and fsync. Now it's a bit longer because we did some other stuff, but that gap was possible before, too.
We still don't tell people it's committed until the fsync returns.
> I'm not sure we're using the same terminology. Committed means the transaction has been hardened to disk. That's the D in ACID.
Yes, the transaction is committed when the transaction is durably written to disk. However, there's not a great API for durably writing to disk, you can write on an FD (or on a mmaped file) and it'll get written eventually hopefully. fsync asks the OS to confirm the writes on an FD are committed durably, but is not without its quirks.
> Otherwise, is the suggestion that there be an artificial delay to allow other transactions to piggyback before returning success on commit 1?
Not really an artificial delay. More that if you have multiple transactions waiting to be comitted, you shouldn't commit them to disk one at a time.
Instead, write several to disk, then fsync, then send commit notices.
A responsible database engine writes transaction data to an FD, then does an fsync, then signals completion to the client; then moves onto the next transaction right?
The suggestion is because fsync is rate limited and blocks further writes while it's pending, you can get better throughput by writing several transactions before calling fsync. The database engine still doesn't signal completion until an fsync after a transaction is written, but you have more data written per fsync. There is a latency penalty for the first transaction in the batch, because you must wait for writes for the whole batch to become durable, but because you're increasing throughput, average latency likely decreases.
Really, there's a fundamental mismatch between the capabilities of the system, the requirements of the database engine, and the interface between them. Synchronous fsync meets the requirements, but an asynchronous fsync would be better for throughput. Then the database engine could write transaction 1, call for fsync 1, write transaction 2, call for fsync 2, etc and once the responses came in, signal commits to the relevant clients. Having more requests in pipeline is key to throughput in a communicating system.
You can have the first transaction wait a bit to see if any other commits can be batched in the same fsync. However that’s off by default as the assumption is you want the transaction to complete as fast as possible.
At least that’s how PostgreSQL implements it.