| The big wins in this article, in what I believe was the order of impact: * They do raw packet reassembly using gopacket, and gopacket keeps TCP reassembly buffers that can grow without bound when you miss a TCP segment. They capped the buffers, and the huge 5G spikes went away. * They were reading whole buffers into memory before handing them off to YAML and JSON parsers. They passed readers instead. * They were using a protobuf diffing library that used `reflect` under the hood, which allocates. They generated their own explicit object inspection thingies. * They stopped compiling regexps on the fly and moved the regexps to package variables. (I actually don't know if this was a significant win; there might just be the three big wins.) This is a great article. But none of these seem Go-specific†, or even GC-specific. They're doing something really ambitious (slurping packets up off the wire against busy API servers, reassembling them in userland into streams, and then parsing the contents of the streams). Memory usage was going to be fiddly no matter what they built with. The problems they ran up against seem pretty textbook. Frankly I'm surprised Go acquitted itself as well as it did here. † Maybe the perils of `reflect` count as a Go thing; it's worth noting that there's folk wisdom in Go-land to avoid `reflect` when possible. |
Buffering is a pretty common bad habit. As programmers, we know stuff is going to go wrong, and we don't want to tell the user "come back later" (or in this case, undercount TCP stream metrics)... we want to save the data and automatically process it when we can so they don't have to. But, unfortunately it's an intrinsic Law Of The Universe that if data comes in a X bytes per second, and leaves at X-k bytes per second, then eventually you will use all storage space in the Universe for your buffer, and then you have the same problem you started with. (Storage limits in mirror may be closer than they appear.) Getting it into your mind that you have to apply back pressure when the system is out of its design specification is pretty crucial. Monitor it, alert on it, fix it, but don't assume that X more bytes of RAM will solve your problem -- there will eventually be a bigger event that exceeds those bounds.
Incidentally, the reason why you can make Zoom calls and use SSH while you download a file is because people added software to your networking stack that drops packets even though buffer space in your consumer-grade router are available. That tells your download to chill out so SSH and video conferencing packets get a chance to be sent to the network. The people that made the router had one focus -- get the highest possible Speedtest score. Throughput, unfortunately, comes at the cost of latency (bandwidth * buffer size for every single packet!), and it's not the right decision overall.
I don't know where I was going with this rant but ... when your system is overloaded, apply backpressure to the consumers. A packet monitoring system can't do that (people wouldn't accept "monitoring is overloaded, stop the main process"), but it does have to give up at some point. If you don't have any more memory to reassemble TCP connections, mark the stream as an error and give up. If you're dumping HTTP requests into a database, and the database stops responding, you'll just have to tell the HTTP client at the other end "too many requests" or "temporarily unavailable". To make the system more reliable, keep an eye on those error metrics and do work to get them down. Don't just add some buffers and cross your fingers; you'll just increase latency and still be paged to fight some fire when an upstream system gets slow ;)
Edit to add: I have a few stories here. One of them is about memory limits, which I always put on any production service I run. sum(memory limits) < sum(memory installed in the machine), of course. One time I had Prometheus running in a k8s cluster, with no memory limit. Sometimes people would run queries that took a lot of RAM, and there was often slack space on the machine, so nothing bad happened. Then someone's mouse driver went crazy, and they opened the same Grafana tab thousands of times. On a high memory query. Obviously, Prometheus used as much RAM as it could, and Linux started OOM killing everything. Prometheus died, was rescheduled on a healthy node, and the next group of tabs killed it. Eventually, the OOM killer had killed the Kubelet on every node, and no further progress could be made. The moral of the story is that it would have been better to serve that user 1000 "sorry, Prometheus died horribly and we can't serve your request right now", which memory limits would have achieved. Instead, we used up all the RAM in the Universe to try to satisfy them, and still failed. (What was the resolution? I think we killed the bad browser, which happened to be a dashboard-displaying TV next to our desks. Then kubelets restarted, and I of course updated Prometheus to have a 4G memory limit. Retried 1000 tabs with an expensive query, and Prometheus died and the frontend proxy served 990 of the tabs an error message. Back pressure! It works! You can imagine how fun this story would have been if I had cluster autoscaling, though. Would have just eventually come back to a $1,000,000 AWS bill and a 1000 node Kubernetes cluster ;)