Hacker News new | ask | show | jobs
by mozdeco 1591 days ago
> the fix has to be in the code that communicates back, it should fail gracefully.

The bug that caused the hang was in the network stack itself. There was no way the calling code could have prevented this in any way. You can see this by taking a look at the linked HTTP3 code. It's not that the higher-level code kept retrying over and over causing the hang, that was not the problem here.

Under "Lessons learned" you can also read "investigating action points both to make the browser more resilient towards such problems". I agree that this is broadly spoken, but it covers ideas that would have made this technically recoverable (e.g. can network requests be compartmentalized to not block on a single network thread?).

1 comments

> There was no way the calling code could have prevented this in any way.

It could have prevented it by not making the call in the first place.

As explained in the article, this problem was not specific to Telemetry:

“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”

Since a browser's job is to make HTTP requests, a bug in the network stack would almost certainly have been hit in other places. This was highly-visible so it was quickly noticed but it's quite possible that a less frequent trigger could have plagued Firefox users for a much longer period of time as HTTP/3 adoption increases.

The article specifically states that normal web requests went through a different code path that did not trigger the bug. That the bug was not technically in the telemetry code is irrelevant - it happened without user interaction because of telemetry and it did not happen (at least as often) with telemetry disabled. Saying that there was no way to prevent it assumes that telemetry could not have been disabled/removed, which is false.
The article provides the correct logic: Telemetry was the first to use that combination of new code but there's no reason to believe that nothing else would ever have used the stack they've been transitioning towards. Had this bug not been found in Telemetry it would have shown up somewhere else, possibly harder to diagnose.