Also, in my experience at least, it's not necessarily that routers drop TCP keep-alives, but rather that the keep-alive interval for most OSes is way longer than the router's connection timeout for idle entries in the NAT table.
I was burned hard by this in Azure. It seems that the default expiry time is around 4 minutes for the TCP load balancers. You can bump it to 30 min, but if I recall the default interval on Linux is 2 hours. Any long-standing idle TCP connections would get into a state where both sides believed they were connected, but the packets would get dropped to the floor. When the LB timed out, it didn't emit any FIN or RST packets, so neither side knew it had been torn down.
Fun debugging on that one. During the day there was enough activity to keep the connections alive, but at night they'd break. The overall behaviour was that the service worked great all day, but the first few actions out-of-business-hours would fail due to application-layer timeouts, and then everything would work great again until it had sat idle for a while.
I was burned hard by this in Azure. It seems that the default expiry time is around 4 minutes for the TCP load balancers. You can bump it to 30 min, but if I recall the default interval on Linux is 2 hours. Any long-standing idle TCP connections would get into a state where both sides believed they were connected, but the packets would get dropped to the floor. When the LB timed out, it didn't emit any FIN or RST packets, so neither side knew it had been torn down.
Fun debugging on that one. During the day there was enough activity to keep the connections alive, but at night they'd break. The overall behaviour was that the service worked great all day, but the first few actions out-of-business-hours would fail due to application-layer timeouts, and then everything would work great again until it had sat idle for a while.