Hacker News new | ask | show | jobs
by justinfrankel 78 days ago
ah reading their analysis, there are errors that explain this. Particularly this:

  tcp_now   = 4,294,960,000  (frozen at pre-overflow value)
  timer     = 4,294,960,000 + 30,000 = 4,294,990,000
              (exceeds uint32 max → wraps to a small number)
timer wraps to a small number, they say

  TSTMP_GEQ(4294960000, 4294990000)
they forgot to wrap it there, it should be TSTMP_GEQ(4294960000, small_number)

  = (int)(4294960000 - 4294990000)
  = (int)(-30000)
  = -30000 >= 0 ?  → false!
wrong!

There may be a short time period where this bug occurs, and if you get enough TCP connections to TIME_WAIT in that period, they could stick around, maybe. But I think the original post is completely overreacting and was probably written by a LLM, lol.

2 comments

There does appear to be a bug, but it's not what the blog describes.

If tcp_now stops updating at <= 2^32 - 30000 milliseconds, then TSTMP_GEQ(tcp_now, timer) will always fail since timer is tcp_now + 30000 which won't wrap.

This does look like it is possible since calculate_tcp_clock() which updates tcp_now only runs when there's TCP traffic. So if at 49 days uptime you halted all TCP traffic and waited about a day, tcp_now would be stuck at the value before you halted TCP traffic.

In cases where tcp_now gets stuck at > 2^32 - 30000, it looks like TCP sockets in the TIME_WAIT will end up being closed immediately instead of waiting 30 seconds, which isn't great either.

Are you sure?

tcp_now’s maximum cannot physically reach 2^32 because the trailing zeros of that number exceeds the bit width of data type.

Therefore, tcp_now + 30000 will wrap when tcp_now is equal to 2^32 - 3000. Your inequality sign should be strict <, otherwise the result does not follow.

Yes, you are correct. Bad editing on my part.

It should be that if tcp_now gets stuck before (<) (2^32 - 30000) ms from boot, it would cause deadline timers for reaping TCP_WAIT would always be greater than tcp_now because it wouldn't wrap. If stuck at or after (>=) (2^32 - 30000), it would cause them to potentially be reaped faster they should be.

Actually looking at the code a bit more, it looks like calculate_tcp_clock() is run at least once per hour even when there's no TCP traffic or sockets open, so getting into the state where it never reaps TIME_WAIT sockets which would be hard to predict if this would happen.

It also looks like if tcp_now gets stuck, other tcp timers may have problems as well.

yep that makes sense
They didn’t need to wrap it because it’s modular arithmetic so the result after casting to int is the same regardless of wrapping behavior. 4294990000 after wrapping is 22704 and 4294960000 - 22704 = 4294937296 which is -30000 after uint to int cast