Hacker News new | ask | show | jobs
by justinfrankel 70 days ago
have multiple macOS machines with 600-1000+ day uptimes, which do TCP connections every minute or so at a minimum, they are still expiring their TIME_WAIT connections as normal.

these kernel versions:

Darwin Kernel Version 20.6.0: Thu Jul 6 22:12:47 PDT 2023; root:xnu-7195.141.49.702.12~1/RELEASE_ARM64_T8101 arm64

Darwin Kernel Version 17.7.0: Wed Apr 24 21:17:24 PDT 2019; root:xnu-4570.71.45~1/RELEASE_X86_64 x86_64

so... wonder what that's about?

2 comments

ah reading their analysis, there are errors that explain this. Particularly this:

  tcp_now   = 4,294,960,000  (frozen at pre-overflow value)
  timer     = 4,294,960,000 + 30,000 = 4,294,990,000
              (exceeds uint32 max → wraps to a small number)
timer wraps to a small number, they say

  TSTMP_GEQ(4294960000, 4294990000)
they forgot to wrap it there, it should be TSTMP_GEQ(4294960000, small_number)

  = (int)(4294960000 - 4294990000)
  = (int)(-30000)
  = -30000 >= 0 ?  → false!
wrong!

There may be a short time period where this bug occurs, and if you get enough TCP connections to TIME_WAIT in that period, they could stick around, maybe. But I think the original post is completely overreacting and was probably written by a LLM, lol.

There does appear to be a bug, but it's not what the blog describes.

If tcp_now stops updating at <= 2^32 - 30000 milliseconds, then TSTMP_GEQ(tcp_now, timer) will always fail since timer is tcp_now + 30000 which won't wrap.

This does look like it is possible since calculate_tcp_clock() which updates tcp_now only runs when there's TCP traffic. So if at 49 days uptime you halted all TCP traffic and waited about a day, tcp_now would be stuck at the value before you halted TCP traffic.

In cases where tcp_now gets stuck at > 2^32 - 30000, it looks like TCP sockets in the TIME_WAIT will end up being closed immediately instead of waiting 30 seconds, which isn't great either.

Are you sure?

tcp_now’s maximum cannot physically reach 2^32 because the trailing zeros of that number exceeds the bit width of data type.

Therefore, tcp_now + 30000 will wrap when tcp_now is equal to 2^32 - 3000. Your inequality sign should be strict <, otherwise the result does not follow.

Yes, you are correct. Bad editing on my part.

It should be that if tcp_now gets stuck before (<) (2^32 - 30000) ms from boot, it would cause deadline timers for reaping TCP_WAIT would always be greater than tcp_now because it wouldn't wrap. If stuck at or after (>=) (2^32 - 30000), it would cause them to potentially be reaped faster they should be.

Actually looking at the code a bit more, it looks like calculate_tcp_clock() is run at least once per hour even when there's no TCP traffic or sockets open, so getting into the state where it never reaps TIME_WAIT sockets which would be hard to predict if this would happen.

It also looks like if tcp_now gets stuck, other tcp timers may have problems as well.

yep that makes sense
They didn’t need to wrap it because it’s modular arithmetic so the result after casting to int is the same regardless of wrapping behavior. 4294990000 after wrapping is 22704 and 4294960000 - 22704 = 4294937296 which is -30000 after uint to int cast
The bug was introduced only last year in macOS 26:

https://github.com/apple-oss-distributions/xnu/blame/f6217f8...

> Apple Community #250867747: macOS Catalina — "New TCP connections can not establish." New connections enter SYN_SENT then immediately close. Existing connections unaffected. Only a reboot fixes it.

This is a weird thing to cite if it's a macOS 26 bug. I quite regularly go over 50 days of uptime without issues so it makes sense for it to be a new bug, and maybe they had different bugs in the past with similar symptoms.

Interesting. The article mentions complaints on the forums running Catalina, so that must be something else.
As someone who also operates fleets of Macs, for years now, there is no possible way this bug predates macOS 26. If the bug description is correct, it must be a new one.
The article is written using AI, so unless you verified the complaints, the safe default assumption is that they don't exist.
It definitely exists, but it could be a completely unrelated issue.

https://discussions.apple.com/thread/250867747