| I built several large enterprise products over WebSockets. I didn't find it that bad. Office networks that either blocked or killed WebSockets were annoying. For some customers they were a non-starter in the early 2010s, but by 2016 or so this seemed to be resolved. Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad. We would see mass TCP issues from time to time as well, but they were pretty much no-ops as they would just trigger a timeout and reconnect the next time the user performed an operation. We would send an ACK back instantly (prior to execution) for any client requested operation, so if we didn't see the ACK within a fairly tight window, the client could proactively reap the WebSocket and try again - customers didn't have to wait long to learn a connection was alive and unclosed. > If you use WebSockets, you must make reconnects be completely free in the common case I agree with this, or at least "close to completely free." But in a normal web application you also need to make latency and failed requests "close to completely free" as well or your application will also die along with the network. This is the point I make in my sibling comment - I think distributed state management is a hard problem, but WebSockets are just a layer on top of that, not a solution or cause of the problem. > you must employ people who are willing to become deeply knowledgeable in how TCP works. I think this is true insofar as you probably want a TCP expert somewhere in your organization to start with, but we never found this particularly complicated. Understanding that the connection isn't trustworthy (that is, when it says it's open, that doesn't mean it works) is the only important fundamental for most engineers to be able to work with WebSockets. |
Can you please share approaches to mitigate this issue?