Besides storage itself, the Postgres compute layer has a good amount of (transient) state that doesn't lend itself to either compute nodes or clients springing in and out of existence in a serverless environment. For instance, a fresh compute node with an unfilled cache can perform horribly, and Postgres client connections don't scale well with transient clients. Both of these problems, and others in the same category, were very noticeable for Aurora Serverless. My understanding is that AWS mitigates these two by an elaborate cache-filling service for new nodes, and a pgbouncer-style proxy pooling connections and hiding compute nodes being rescheduled from clients.
What's Neon's point of view about transient state in nodes? Is there a world where serverless client connections are stateless, or is the set up overhead not expected to be worth the cost?
Right. Our design guideline is to get as much serverless behavior as possible while keeping full Postgres compatibility (in terms of features and expected performance). Single node Postgres can give you hundreds of thousands of small RW queries per second, so competing connections should be a few compare-and-swap instructions away from the shared state to provide this performance. So for the primary, it means it should be just a Postgres in the container or VM, and we have to deal with consequences (cache pre-warm, handle cross-node migrations, etc).
However, read-only nodes require less coordination, and we have way more freedom there, so read-only Postgres as a function seems to be a more feasible concept.
This is a very good question. We are working on it and will be publishing a blog post on autoscaling very soon. We are experimenting with VM Migration technology that would allow to transfer the state between compute nodes and failover traffic.
We have some encouraging early results, but haven't committed to a particular technology (like cloud hypervisor) yet.
Thank you for publishing the TLA+ model, that dramatically increases my level of trust in Neon. (I'm on the early adopter list, got my invite a week or two ago but haven't been able to give it a spin yet.)
> Right now, such a change requires humans to be in the loop to ensure that the old safekeeper is actually down. It is on our roadmap to automate this procedure.
If you do implement this (which I don't recommend), be certain to also model it with TLA+. This level of automation, IMO, requires a human in the loop + a ton of visibility tracking on when it is happening.
A good way to roll it out is "semi-automation"—implement the automation but use it to ask a human to approve. The human will then do the normal (manual) verification. After you've run that successfully for a year, and your TLA+ model passes, you can then decide to fully automate without a human in the loop.
Otherwise, you're asking for an outage (caused by bad failover) IMO, and possibly data loss.
Do you attempt to guarantee linearizability of read-only operations? The scenario I'm concerned about is when a partitioned compute node is processing a read-only transaction from a partitioned client, and neither has noticed the partitioned compute node has been replaced in a later term. Do you use a lease system for this that relies on the partitioned compute nodes to be able to accurately measure the passaged of time (not wall clock time), or do you have the compute nodes contact a quorum of acceptors before replying to read-only queries as well?
Good catch! Currently, we don't, and we rely on k8s to stop the old node. Technically speaking, if k8s and our control plane are always good at stopping the old primary, we don't need consensus at all. So that is more of a question of what set of problems we can see if there is a bug in our orchestration code. Split-brain seemed to be unacceptable. But with stale reads, we decided that we can only rely on k8s without double-checking that on our side.
Postgres tracks maximal LSN among the evicted pages and passes it to the pageserver in the page request. If the pageserver hasn't received that LSN, it will wait for it to arrive.
What's Neon's point of view about transient state in nodes? Is there a world where serverless client connections are stateless, or is the set up overhead not expected to be worth the cost?