Hacker News new | ask | show | jobs
by jrockway 1728 days ago
Agree strongly here. These are common sources of memory leaks in any language, and it's very likely that rewriting this code in Rust would lead to the exact same problems. (Other cases on HN, like Discord's in-memory cache and Twitch's "memory ballast" thing, are pretty Go specific -- the identical C program wouldn't have those particular bugs. But, the Go developers read these incident reports and do fix the underlying causes; I think Twitch's need for the "memory ballast" got fixed a few years ago, but well after the "don't use Go for that" meme was popularized.)

Buffering is a pretty common bad habit. As programmers, we know stuff is going to go wrong, and we don't want to tell the user "come back later" (or in this case, undercount TCP stream metrics)... we want to save the data and automatically process it when we can so they don't have to. But, unfortunately it's an intrinsic Law Of The Universe that if data comes in a X bytes per second, and leaves at X-k bytes per second, then eventually you will use all storage space in the Universe for your buffer, and then you have the same problem you started with. (Storage limits in mirror may be closer than they appear.) Getting it into your mind that you have to apply back pressure when the system is out of its design specification is pretty crucial. Monitor it, alert on it, fix it, but don't assume that X more bytes of RAM will solve your problem -- there will eventually be a bigger event that exceeds those bounds.

Incidentally, the reason why you can make Zoom calls and use SSH while you download a file is because people added software to your networking stack that drops packets even though buffer space in your consumer-grade router are available. That tells your download to chill out so SSH and video conferencing packets get a chance to be sent to the network. The people that made the router had one focus -- get the highest possible Speedtest score. Throughput, unfortunately, comes at the cost of latency (bandwidth * buffer size for every single packet!), and it's not the right decision overall.

I don't know where I was going with this rant but ... when your system is overloaded, apply backpressure to the consumers. A packet monitoring system can't do that (people wouldn't accept "monitoring is overloaded, stop the main process"), but it does have to give up at some point. If you don't have any more memory to reassemble TCP connections, mark the stream as an error and give up. If you're dumping HTTP requests into a database, and the database stops responding, you'll just have to tell the HTTP client at the other end "too many requests" or "temporarily unavailable". To make the system more reliable, keep an eye on those error metrics and do work to get them down. Don't just add some buffers and cross your fingers; you'll just increase latency and still be paged to fight some fire when an upstream system gets slow ;)

Edit to add: I have a few stories here. One of them is about memory limits, which I always put on any production service I run. sum(memory limits) < sum(memory installed in the machine), of course. One time I had Prometheus running in a k8s cluster, with no memory limit. Sometimes people would run queries that took a lot of RAM, and there was often slack space on the machine, so nothing bad happened. Then someone's mouse driver went crazy, and they opened the same Grafana tab thousands of times. On a high memory query. Obviously, Prometheus used as much RAM as it could, and Linux started OOM killing everything. Prometheus died, was rescheduled on a healthy node, and the next group of tabs killed it. Eventually, the OOM killer had killed the Kubelet on every node, and no further progress could be made. The moral of the story is that it would have been better to serve that user 1000 "sorry, Prometheus died horribly and we can't serve your request right now", which memory limits would have achieved. Instead, we used up all the RAM in the Universe to try to satisfy them, and still failed. (What was the resolution? I think we killed the bad browser, which happened to be a dashboard-displaying TV next to our desks. Then kubelets restarted, and I of course updated Prometheus to have a 4G memory limit. Retried 1000 tabs with an expensive query, and Prometheus died and the frontend proxy served 990 of the tabs an error message. Back pressure! It works! You can imagine how fun this story would have been if I had cluster autoscaling, though. Would have just eventually come back to a $1,000,000 AWS bill and a 1000 node Kubernetes cluster ;)

1 comments

> it's an intrinsic Law Of The Universe that if data comes in a X bytes per second, and leaves at X-k bytes per second, then eventually you will use all storage space in the Universe for your buffer,

This is known as Little's Law. Using Little's Law, you know that if the average time spent in queue is more than the average time it takes for a new entry to be added to the queue, then your queue fills up.

Or in other words, a Little at a time adds up to a lot.
Did Little formulate multiple eponymous laws? Since that does not seem to be the Little's law that I'm familiar with.
Here's a good introduction to Little's Law and associated operational rules derived from it on queues: http://web.eng.ucsd.edu/~massimo/ECE158A/Handouts_files/Litt...
Thanks, but I had already had courses on that. We never associated the condition for stability (λ<μ) with Little's law (L=λW).