| Hi, I'm the author of GoshawkDB. Thanks for your questions - they're certainly not answered on the website so I'll do my best to answer them here and incorporate them into the website later. > Curious how this works at scale. For example if a node is down and N requests start blocking, is there an upper bound to N? What happens when this is reached? Does N map to a goroutine? A file descriptor? Is there timeouts? The blocking actually occurs on a per-client basis: when a client submits a txn to the server to which it's connected, that server will calculate which other server nodes need to be contacted in order for the txn to be voted on. If it cannot find enough servers due to failures then it will outright reject the txn and tell the client that there are currently too many failures going on. However, the server could find the right set of server nodes and send off the txn to be voted on and send of the txn for verification and validation. Right at that point there could be further failures and so it's not known how far those messages got. It is in this case that the txn could block as until some node recover it's impossible to safely abort the txn. But it's important to note that: 1. If a txn starts blocking, that'll block just that txn and that client (as the client is blocked waiting on the outcome of the txn). No other part of the server or any other server is affected. 2. Other clients can be submitting other txns at the same time that need different un-failed nodes to proceed, and they progress just fine. 3. There is no global determination (yet) of whether a node is failed or not. Thus this approach can cope with really weird network issues - eg random connections between different nodes failing. There is no need for the whole remaining cluster to agree that nodes X, Y and Z are "out" - none of the algorithms depend on that sort of thing. I hope this answers your questions. |
With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?
Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.
So just trying to get a feeling for how this could have infinite amount of waiting transactions due to partial failure and still keep taking requests.
Thanks for the reply, this stuff is always interesting