Hacker News new | ask | show | jobs
by bri3d 1639 days ago
I built several large enterprise products over WebSockets. I didn't find it that bad.

Office networks that either blocked or killed WebSockets were annoying. For some customers they were a non-starter in the early 2010s, but by 2016 or so this seemed to be resolved.

Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

We would see mass TCP issues from time to time as well, but they were pretty much no-ops as they would just trigger a timeout and reconnect the next time the user performed an operation. We would send an ACK back instantly (prior to execution) for any client requested operation, so if we didn't see the ACK within a fairly tight window, the client could proactively reap the WebSocket and try again - customers didn't have to wait long to learn a connection was alive and unclosed.

> If you use WebSockets, you must make reconnects be completely free in the common case

I agree with this, or at least "close to completely free." But in a normal web application you also need to make latency and failed requests "close to completely free" as well or your application will also die along with the network. This is the point I make in my sibling comment - I think distributed state management is a hard problem, but WebSockets are just a layer on top of that, not a solution or cause of the problem.

> you must employ people who are willing to become deeply knowledgeable in how TCP works.

I think this is true insofar as you probably want a TCP expert somewhere in your organization to start with, but we never found this particularly complicated. Understanding that the connection isn't trustworthy (that is, when it says it's open, that doesn't mean it works) is the only important fundamental for most engineers to be able to work with WebSockets.

2 comments

> Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

Can you please share approaches to mitigate this issue?

As rakoo said, exponential backoff mitigates the thundering herd. I was going to say add some jitter to the time before reconnecting, then I realized rakoo already said "after a random short time", which is exactly what jitter is. (edited for coffee kicking in)
Congestion avoidance algorithms such as TCP Reno and TCP Vegas. basically code clients to back off if they detect a situation where they may be a member of a thundering herd.
Exponential back off. Basically try to reconnect after a random short time, if that doesn't work try with a time twice longer, then twice again, etc..
Usually you want the 2x wait to be a random time between 1.5x and 2x longer or something.
> Office networks that either blocked or killed WebSockets were annoying

Curious how did they detect WS usage? Were you running on HTTP or did they just kill any long-lived TCP connection? Root certs?

No, we always ran on TLS. There were a few classes of these:

* Filtering MITM application firewall solutions which installed a new trusted root CA on employee machines and looked at the raw traffic. These would usually be configured to wholesale kill the connection when they saw an UPGRADE because the filtering solutions couldn't understand the traffic format and they were considered a security risk.

* Oldschool HTTP proxy based systems which would blow up when CONNECT was kept alive for very long.

* Firewalls which killed long-lived TCP connections just at the TCP level. The worst here were where there was a mismatch somewhere and we never got a FIN. But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

We also tried running WebSockets on a different port for awhile, which was not a good idea as many organizations only allowed 443.

> But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

I found the best way to handle this was with an application level heartbeat. That bypassed dealing with any weirdness of the client firewalls, TCP spoofing, etc.

Something like ping every 30 seconds and say goodbye to the socket if we don't receive 2 seems work reasonably well.

And it also prevent most idle killing base tcp disconnect from happening.

And even if some network is so dumb that decides to kill it under 30s, it is a non issue as that network won't be even usable in normal means. (How do you download any big file if it always disconnect instantly?)