| I worked for a while on a well-known product that used (and perhaps still uses) WebSockets for its core feature. I very much agree with the bulk of the arguments made in this blog post. In particular, I found this: - Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once. - On a smaller scale, but more frequently: office networks of large customers would do the same thing. - Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation - And of course, unless you do not want to upgrade your server software, you must at some point restart your servers (and again, your cloud hosting provider likely has no SLA on the uptime of an individual machine) - As is pointed out in the article, a TCP connection can cease to transmit data even though it has not closed. So attention must be paid to this. If you use WebSockets, you must make reconnects be completely free in the common case and you must employ people who are willing to become deeply knowledgeable in how TCP works. WebSockets can be a tremendously powerful tool to help in making a great product, but in general they are almost always will add more complexity and toil with lower reliability. (edited typos) |
Office networks that either blocked or killed WebSockets were annoying. For some customers they were a non-starter in the early 2010s, but by 2016 or so this seemed to be resolved.
Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.
We would see mass TCP issues from time to time as well, but they were pretty much no-ops as they would just trigger a timeout and reconnect the next time the user performed an operation. We would send an ACK back instantly (prior to execution) for any client requested operation, so if we didn't see the ACK within a fairly tight window, the client could proactively reap the WebSocket and try again - customers didn't have to wait long to learn a connection was alive and unclosed.
> If you use WebSockets, you must make reconnects be completely free in the common case
I agree with this, or at least "close to completely free." But in a normal web application you also need to make latency and failed requests "close to completely free" as well or your application will also die along with the network. This is the point I make in my sibling comment - I think distributed state management is a hard problem, but WebSockets are just a layer on top of that, not a solution or cause of the problem.
> you must employ people who are willing to become deeply knowledgeable in how TCP works.
I think this is true insofar as you probably want a TCP expert somewhere in your organization to start with, but we never found this particularly complicated. Understanding that the connection isn't trustworthy (that is, when it says it's open, that doesn't mean it works) is the only important fundamental for most engineers to be able to work with WebSockets.