Hacker News new | ask | show | jobs
by sdab 3777 days ago
"When used as a failure detector, timeouts are just a guess that something is wrong. (If they could, distributed algorithms would do without clocks entirely, but then consensus becomes impossible [10]."

Having just re-read Lynch's paper, can you explain what you mean here? I didn't see anything explicitly relying on time. It could be there is some implicit usage I didnt see. Additionally, the paper's impossibility result is about "perfectly correct consensus" which applies with and without clocks and then has a positive result for "partially correct consensus" (i.e. not deciding a value is a correct result). Im not sure which you mean when you say "consensus becomes impossible" as it is either already impossible (the perfectly correct protocol) with one faulty process or (to my understanding) not dependent on time (the partially correct protocol).

p.s. great article!

3 comments

The FLP result (Fischer, Lynch, Paterson) shows that consensus cannot reliably be solved in an asynchronous system (if you do not make any timing assumptions) if one or more processes can fail. The impossibility result shows that any consensus algorithm in such a system will have executions in which it waits forever and never makes a decision, which breaks the liveness property of consensus.

However, if you do allow some timing assumptions — just enough to measure a timeout, nothing more — then consensus becomes possible. In this case, the timeout is known as an "unreliable failure detector". See Chandra and Toueg's paper for details.

In a good algorithm, the safety properties do not depend on timing, only the liveness properties do. Rather than "consensus being impossible" perhaps it would be clearer to say "consensus may never terminate" in an asynchronous system. But "consensus impossible" is the standard way of phrasing this issue in the distributed systems literature.

I'm having trouble relating this comment to the referenced paper [1], which seems to be explicit that its impossiblity result does not apply when synchronised clocks are available.

For example, on page 375: “Crucial to our proof is that processing is completely asynchronous; that is, we make no assumptions about the relative speeds of processes or about the delay time in delivering a message. We also assume that processes do not have access to synchronized clocks, so algorithms based on time-outs, for example, cannot be used.”

1. http://www.cs.princeton.edu/courses/archive/fall07/cos518/pa...

You mean the OP's comment or mine? I agree with you if you mean the OP's comment. Though your quote refers to synchronized clocks. I dont think OP is referring to synchronized clocks, though your point that Lynch et al does not rely on timeouts is what Im getting at.
"When used as a failure detector, timeouts are just a guess that something is wrong."

I send a packet over the network. I configured the timeout to be 10 seconds. 10 seconds passes. Is the packet truly lost? Or, maybe the packet was received, processed, and and a response is on the way back, but it takes 20 seconds instead of 10! Or maybe it takes an hour instead of 20 seconds. Or a decade.

We don't know if the packet is truly lost or is just taking >$timeout_timer and making assumptions is dangerous.

My question was about his claim in regards to the result in the paper. In practice, you are right but a consensus algorithm is still "partially correct" if it never decides on a value. It is only incorrect if different nodes decide on different values. For example, paxos does is not guaranteed to decide on a value. So timeouts are useful guessing tools, but I dont see how there is no consensus possible without them.