Hacker News new | ask | show | jobs
by benschulz 1680 days ago
Quoting from the paper's description of Phase 1:

> (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

It says an acceptor must respond with the highest-numbered proposal (if any) that it has accepted.

How is acceptor C going to do that after step 9? That's where the bug is introduced, I think, not anywhere in the paper.

2 comments

mjb (author) links to an SO post which goes:

> If you look at _Lamport's paxos proofs_ he treats an accept as a promise... But this is not pointed out in _Paxos Made Simple_. In fact, it appears Lamport took great pains to specify that an accept was not a promise.

> The problem is when you combine the weaker portions of both variants; as the OP did and several implementations do. Then you run into this catastrophic bug.

That is, Lamport's proofs and Paxos Made Simple contradict each other ever so slightly to trip everyone all at the same time:

> What's more I looked through several proprietary and open-source paxos implementations and they all had the bug submitted by the OP!

https://stackoverflow.com/questions/29880949/contradiction-i...

Lamport himself acknowledges as much: https://www.youtube.com/watch?t=4398&v=8-Bc5Lqgx_c

Lamport acknowledges the report, not the error. I'm still not convinced there even is an error.

The state after step 9 should be the same as after step 7, i.e. `A(-:-,100) B(100:b,100) C(100:b,-)` because C needs to retain the "highest-numbered proposal" it accepted, not the one it accepted last. That means the nine steps outlined in the post/on StackOverflow do not, by themselves, demonstrate any problem.

So what additional steps are missing/what alternative steps actually produce an inconsistency/divergence?

There isn't an "error" / "bug" in either papers when one looks at them in isolation. Together, they seem to have a very subtle difference in the algorithm to trip most distributed systems practioners.
It's a misunderstanding of the paper I think. The blog post author and SO post both make the mistake of altering the Paxos Made Simple algorithm to allow sending phase 2 messages to any majority set of acceptors. But that doesn't work unless you also alter the P2b wording so that an acceptor also treats prior accepts as promises. I can see there is a slight ambiguity in the reading of the PMS paper itself that would make you think this is okay, but saying it is a bug in the paper is a bit of stretch.