| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by erentz 3784 days ago

"If you still don’t believe me about process pauses, then consider instead that the file-writing request may get delayed in the network before reaching the storage service. Packet networks such as Ethernet and IP may delay packets arbitrarily, and they do [7]: in a famous incident at GitHub, packets were delayed in the network for approximately 90 seconds [8]. This means that an application process may send a write request, and it may reach the storage server a minute later when the lease has already expired."

This statement confused me. It seems to say that the packets were delayed in the network for 90 seconds before being delivered. From reading the original sources it actually sounds like packets were discarded by the switches, so the original requests discarded, and the nodes were partitioned for 90 seconds. When the partition was removed both nodes thought they were the leader and simultaneously requested the other to shutdown. Can anyone confirm? Keeping packets delayed in a network for 90 seconds would seem quite difficult (though not impossible assuming certain bugs).

Edit: On re-reading I think this is just talking about the network stack in general - not the network. A temporary partition may delay delivery of your request until max TCP retries is exceeded on your host, if its recovered before then your request may arrive later than you intended.

1 comments

martinkl 3784 days ago

I only have the blog post to go by, and don't have first-hand information. It seems possible to me that a few packets could indeed be delayed by 90 seconds, perhaps stuck in a switch buffer somewhere, although this would be a small number of packets since those buffers are not very big.

However, yes, I was thinking about the network stack as a whole. Any kind of retries, e.g. TCP retransmission on timeout, effectively turns packet loss into packet delay (within certain bounds). Thus, even if you have a network interruption ("partition") and all packets are dropped, it could happen that after the interruption is fixed, a node receives packets that were sent before or during the interruption. For this reason, I find it helpful to think of it as delay, not just packet loss.