Hacker News new | ask | show | jobs
by oautholaf 1638 days ago
I worked for a while on a well-known product that used (and perhaps still uses) WebSockets for its core feature. I very much agree with the bulk of the arguments made in this blog post.

In particular, I found this:

- Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once.

- On a smaller scale, but more frequently: office networks of large customers would do the same thing.

- Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation

- And of course, unless you do not want to upgrade your server software, you must at some point restart your servers (and again, your cloud hosting provider likely has no SLA on the uptime of an individual machine)

- As is pointed out in the article, a TCP connection can cease to transmit data even though it has not closed. So attention must be paid to this.

If you use WebSockets, you must make reconnects be completely free in the common case and you must employ people who are willing to become deeply knowledgeable in how TCP works.

WebSockets can be a tremendously powerful tool to help in making a great product, but in general they are almost always will add more complexity and toil with lower reliability.

(edited typos)

7 comments

I built several large enterprise products over WebSockets. I didn't find it that bad.

Office networks that either blocked or killed WebSockets were annoying. For some customers they were a non-starter in the early 2010s, but by 2016 or so this seemed to be resolved.

Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

We would see mass TCP issues from time to time as well, but they were pretty much no-ops as they would just trigger a timeout and reconnect the next time the user performed an operation. We would send an ACK back instantly (prior to execution) for any client requested operation, so if we didn't see the ACK within a fairly tight window, the client could proactively reap the WebSocket and try again - customers didn't have to wait long to learn a connection was alive and unclosed.

> If you use WebSockets, you must make reconnects be completely free in the common case

I agree with this, or at least "close to completely free." But in a normal web application you also need to make latency and failed requests "close to completely free" as well or your application will also die along with the network. This is the point I make in my sibling comment - I think distributed state management is a hard problem, but WebSockets are just a layer on top of that, not a solution or cause of the problem.

> you must employ people who are willing to become deeply knowledgeable in how TCP works.

I think this is true insofar as you probably want a TCP expert somewhere in your organization to start with, but we never found this particularly complicated. Understanding that the connection isn't trustworthy (that is, when it says it's open, that doesn't mean it works) is the only important fundamental for most engineers to be able to work with WebSockets.

> Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

Can you please share approaches to mitigate this issue?

As rakoo said, exponential backoff mitigates the thundering herd. I was going to say add some jitter to the time before reconnecting, then I realized rakoo already said "after a random short time", which is exactly what jitter is. (edited for coffee kicking in)
Congestion avoidance algorithms such as TCP Reno and TCP Vegas. basically code clients to back off if they detect a situation where they may be a member of a thundering herd.
Exponential back off. Basically try to reconnect after a random short time, if that doesn't work try with a time twice longer, then twice again, etc..
Usually you want the 2x wait to be a random time between 1.5x and 2x longer or something.
> Office networks that either blocked or killed WebSockets were annoying

Curious how did they detect WS usage? Were you running on HTTP or did they just kill any long-lived TCP connection? Root certs?

No, we always ran on TLS. There were a few classes of these:

* Filtering MITM application firewall solutions which installed a new trusted root CA on employee machines and looked at the raw traffic. These would usually be configured to wholesale kill the connection when they saw an UPGRADE because the filtering solutions couldn't understand the traffic format and they were considered a security risk.

* Oldschool HTTP proxy based systems which would blow up when CONNECT was kept alive for very long.

* Firewalls which killed long-lived TCP connections just at the TCP level. The worst here were where there was a mismatch somewhere and we never got a FIN. But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

We also tried running WebSockets on a different port for awhile, which was not a good idea as many organizations only allowed 443.

> But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

I found the best way to handle this was with an application level heartbeat. That bypassed dealing with any weirdness of the client firewalls, TCP spoofing, etc.

Something like ping every 30 seconds and say goodbye to the socket if we don't receive 2 seems work reasonably well.

And it also prevent most idle killing base tcp disconnect from happening.

And even if some network is so dumb that decides to kill it under 30s, it is a non issue as that network won't be even usable in normal means. (How do you download any big file if it always disconnect instantly?)

> disconnect all long-lived TCP sockets in an availability zone in unison

I don't know what this means, but it sounds ridiculous. This would cause havoc with any sort of persistent tunnel or stateful connection, such as most database clients. Do you perhaps mean this just happens at ingress? That is much more believable and not as big of a deal.

> office networks of large customers would do the same thing.

Sounds like a personal problem. In all seriousness, your clients should handly any sort of network disconnect gracefully. It's foolish to assume TCP connections are durable, or to assume that you won't be hit by a thundering herd.

Maybe I'm old fashioned but TCP hasn't changed much over the years, none of these problems are novel to me, it's well-trodden ground and there are many simple techniques to building durable clients.

Also, all of the things you mention also affect plain old HTTP, especially HTTP2. There shouldn't be a significant difference in how you treat them, other than the fact you cannot assume they're all short lived connections.

Most applications written using HTTP, in my experience, do not have deep dependencies on the longevity of the HTTP2 connection. In my experience, TCP connections for HTTP2 are typically terminated at your load balancer or similar. So reconnections here happen completely unseen by either the client application in the field or the servers where the business logic is.

For us -- and I think this is common -- the persistent WebSocket connection allowed a set of assumptions around the shared state of the client and server that would have to be re-negotiated when reconnecting. The fact that this renegotiation was non-trivial was a major driver in selecting WebSockets in the first place. With HTTP, regardless of HTTP2 or QUIC, your application protocol very much is set up to re-negotiate things on a per-request basis. And so the issues I list don't tend to affect HTTP-based applications.

> the persistent WebSocket connection allowed a set of assumptions around the shared state of the client and server that would have to be re-negotiated when reconnecting. The fact that this renegotiation was non-trivial was a major driver in selecting WebSockets in the first place. With HTTP, regardless of HTTP2 or QUIC, your application protocol very much is set up to re-negotiate things on a per-request basis. And so the issues I list don't tend to affect HTTP-based applications.

I think this describes a poor choice in technology. There's no silver bullet here, and it sounds like you made a lot of questionable tradeoffs. Assuming that "session" state persists beyond the lifetime of either the client or the server is generally problematic. It's always easier for one party to be stateless, but you can become stateful for the duration of the transaction.

Shared state is best used as communications optimization, and maybe sometimes useful for security reasons.

> Assuming that "session" state persists beyond the lifetime of either the client or the server is generally problematic.

I don't think you're interpreting the problem right? The state is tied to the connection, not outliving client or server. But it outlives single requests, and would be uncomfortably expensive to re-establish per request.

I'm saying is that it's unrealistic to expect to hold a persistent TCP connection for an extended period of time across networking environments you do not control.

Making things not uncomfortably expensive is a good idea.

Relying on websockets to solve this for you is a mistake. It's convenient, but not robust. How would you solve it without websockets using traditional HTTP? The same solution should be used with websockets, but unlocks tremendous opportunities for optimization.

> How would you solve it without websockets using traditional HTTP?

You'd probably do the uncomfortably expensive setup, then give the client a token and store the settings in a database. And then do your best to cache it and have fast paths to reestablish from the cache on the same server or on different servers.

Not only could this add a lot of complication, now you've actually introduced the problem of state outliving your endpoints! You do unlock new ways to optimize, but you pay a high cost to get there. There's a very good chance this rearchitecture is a bad idea.

>Sounds like a personal problem. In all seriousness, your clients should handly any sort of network disconnect gracefully

That can be complex. Corporate MITM filtering boxes, "intrusion detection" appliances, firewalls, etc, can just decide to drop NAT entries, drop packets, break MTU path discovery, etc. Yes, there are things you can do. But then customers restart/reload when things don't happen instantly, etc. I don't know that there's a simple playbook.

None of this is particular to websockets, and in addition:

> you must employ people who are willing to become deeply knowledgeable in how TCP works

You already needed that for your HTTP based application; it's a fundamental of networked computing. Developers skipping out on mechanical sympathy are often duds, in my experience.

> employ people who are willing to become deeply knowledgeable in how TCP works

I used Microsoft's SignalR library. It knows TCP pretty well and handles most of the common pitfalls nearly automatically.

> customers to reconnect all at once.

That is definitely a problem. So we had to code it from the get go with the assumption that either the network will go down or the server will be bounced for an upgrade.

Actually most of the issues I encountered had to do with various iPad versions going to sleep and then handling WebSockets in different ways once it woke up.

Hosted? How are your costs? I hear this catches people sometimes.

Any other advice for SignalR?

Quic is the right idea. Encrypt everything including state. Kill middle boxes.

History has shown that if you allow middle boxes they will ruin everything.

> - Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once.

I’m kind of surprised that it was that infrequent. I would expect software upgrades should cause long-lived sockets to reset…

or a scale-up of an ELB
> - Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation

What's the alternative that's going to work here?