Hacker News new | ask | show | jobs
by InclinedPlane 5226 days ago
On a hunch I converted 497 days to seconds, and it works out to be 42.9 million. A suspiciously familiar number, as it is precisely 2^32 hundredths of a second. Since 10 ms is a common clock resolution on systems that points to an obvious cause: a 32-bit counter for time rolling over and horfing the relative age calculations, so all of the sockets that were open prior to the rollover stay open forever.
2 comments

Windows 95 crashed after 49.7 days (2^32 ms) for similar reasons: http://news.cnet.com/2100-1040-222391.html
There are two things which are a bit off-putting about this.

First, the fact that the same exact type of bug had been known in 1999 and yet they either failed to fix it in the newer code base or they reimplemented the exact same bug in new code.

Second, almost certainly the reason that these bugs weren't caught earlier is because it's unusual for Windows to have such long uptime (50 days for Win 9x is impressive, and over a year for Windows server equally so). More so, almost certainly the average user has such low expectations of windows reliability that if they see the system become unstable or slow after a long period of uptime they will as a rule merely reboot the system rather than investigate.

Edit: a thought occurs to me. Perhaps the "fix" for the older problem was to simply change from using milliseconds since last boot for tcp/ip socket age to using hundredths of a second. I really, really hope that wasn't the case.

This is very much a Windows issue however, as other operating systems have higher resolution TCP timestamps, e.g. 1ms on Linux which rolls over every 49.7 days, and yet they do not have issues with closing sockets.