Hacker News new | ask | show | jobs
by tptacek 1731 days ago
The big wins in this article, in what I believe was the order of impact:

* They do raw packet reassembly using gopacket, and gopacket keeps TCP reassembly buffers that can grow without bound when you miss a TCP segment. They capped the buffers, and the huge 5G spikes went away.

* They were reading whole buffers into memory before handing them off to YAML and JSON parsers. They passed readers instead.

* They were using a protobuf diffing library that used `reflect` under the hood, which allocates. They generated their own explicit object inspection thingies.

* They stopped compiling regexps on the fly and moved the regexps to package variables. (I actually don't know if this was a significant win; there might just be the three big wins.)

This is a great article. But none of these seem Go-specific†, or even GC-specific. They're doing something really ambitious (slurping packets up off the wire against busy API servers, reassembling them in userland into streams, and then parsing the contents of the streams). Memory usage was going to be fiddly no matter what they built with. The problems they ran up against seem pretty textbook.

Frankly I'm surprised Go acquitted itself as well as it did here.

Maybe the perils of `reflect` count as a Go thing; it's worth noting that there's folk wisdom in Go-land to avoid `reflect` when possible.

10 comments

Agree strongly here. These are common sources of memory leaks in any language, and it's very likely that rewriting this code in Rust would lead to the exact same problems. (Other cases on HN, like Discord's in-memory cache and Twitch's "memory ballast" thing, are pretty Go specific -- the identical C program wouldn't have those particular bugs. But, the Go developers read these incident reports and do fix the underlying causes; I think Twitch's need for the "memory ballast" got fixed a few years ago, but well after the "don't use Go for that" meme was popularized.)

Buffering is a pretty common bad habit. As programmers, we know stuff is going to go wrong, and we don't want to tell the user "come back later" (or in this case, undercount TCP stream metrics)... we want to save the data and automatically process it when we can so they don't have to. But, unfortunately it's an intrinsic Law Of The Universe that if data comes in a X bytes per second, and leaves at X-k bytes per second, then eventually you will use all storage space in the Universe for your buffer, and then you have the same problem you started with. (Storage limits in mirror may be closer than they appear.) Getting it into your mind that you have to apply back pressure when the system is out of its design specification is pretty crucial. Monitor it, alert on it, fix it, but don't assume that X more bytes of RAM will solve your problem -- there will eventually be a bigger event that exceeds those bounds.

Incidentally, the reason why you can make Zoom calls and use SSH while you download a file is because people added software to your networking stack that drops packets even though buffer space in your consumer-grade router are available. That tells your download to chill out so SSH and video conferencing packets get a chance to be sent to the network. The people that made the router had one focus -- get the highest possible Speedtest score. Throughput, unfortunately, comes at the cost of latency (bandwidth * buffer size for every single packet!), and it's not the right decision overall.

I don't know where I was going with this rant but ... when your system is overloaded, apply backpressure to the consumers. A packet monitoring system can't do that (people wouldn't accept "monitoring is overloaded, stop the main process"), but it does have to give up at some point. If you don't have any more memory to reassemble TCP connections, mark the stream as an error and give up. If you're dumping HTTP requests into a database, and the database stops responding, you'll just have to tell the HTTP client at the other end "too many requests" or "temporarily unavailable". To make the system more reliable, keep an eye on those error metrics and do work to get them down. Don't just add some buffers and cross your fingers; you'll just increase latency and still be paged to fight some fire when an upstream system gets slow ;)

Edit to add: I have a few stories here. One of them is about memory limits, which I always put on any production service I run. sum(memory limits) < sum(memory installed in the machine), of course. One time I had Prometheus running in a k8s cluster, with no memory limit. Sometimes people would run queries that took a lot of RAM, and there was often slack space on the machine, so nothing bad happened. Then someone's mouse driver went crazy, and they opened the same Grafana tab thousands of times. On a high memory query. Obviously, Prometheus used as much RAM as it could, and Linux started OOM killing everything. Prometheus died, was rescheduled on a healthy node, and the next group of tabs killed it. Eventually, the OOM killer had killed the Kubelet on every node, and no further progress could be made. The moral of the story is that it would have been better to serve that user 1000 "sorry, Prometheus died horribly and we can't serve your request right now", which memory limits would have achieved. Instead, we used up all the RAM in the Universe to try to satisfy them, and still failed. (What was the resolution? I think we killed the bad browser, which happened to be a dashboard-displaying TV next to our desks. Then kubelets restarted, and I of course updated Prometheus to have a 4G memory limit. Retried 1000 tabs with an expensive query, and Prometheus died and the frontend proxy served 990 of the tabs an error message. Back pressure! It works! You can imagine how fun this story would have been if I had cluster autoscaling, though. Would have just eventually come back to a $1,000,000 AWS bill and a 1000 node Kubernetes cluster ;)

> it's an intrinsic Law Of The Universe that if data comes in a X bytes per second, and leaves at X-k bytes per second, then eventually you will use all storage space in the Universe for your buffer,

This is known as Little's Law. Using Little's Law, you know that if the average time spent in queue is more than the average time it takes for a new entry to be added to the queue, then your queue fills up.

Or in other words, a Little at a time adds up to a lot.
Did Little formulate multiple eponymous laws? Since that does not seem to be the Little's law that I'm familiar with.
Here's a good introduction to Little's Law and associated operational rules derived from it on queues: http://web.eng.ucsd.edu/~massimo/ECE158A/Handouts_files/Litt...
Thanks, but I had already had courses on that. We never associated the condition for stability (λ<μ) with Little's law (L=λW).
> They stopped compiling regexps on the fly and moved the regexps to package variables. (I actually don't know if this was a significant win; there might just be the three big wins.)

Anecdotally, this could be a huge win, depending on how often it's called.

A guy I was working with, new to Go, was writing a router config parser and asked why it was so slow.

The first thing I did was moved regexp.Compile from a hot path into a broader scope. It went from something like 40 seconds down to 2 on my machine.

I think it's easy to assume that in this case Go's regex library would keep an internal cache of expressions, using the expression string as a map key. But on the other, I can see why they haven't implemented it, because it obscures memory usage from direct control of the author.

It would probably be a good idea to add performance hints like 'prefer to put static regular expressions in a package variable' in a linter or go vet.

Actually I would expect any package not to silently cache things until explicitly specified. This otherwise creates an unbounded memory leak.

Moving static (at least as much it concerns the loop) expressions out of a loop is one of the most fundamental optimizations a programmer should do when writing code.

> I think it's easy to assume that in this case Go's regex library would keep an internal cache of expressions

IMHO, the stdlib doing implicit memoization is a catastrophe waiting to happen.

I think that handling regexps and caching functions are two composable and orthogonal features that should be handled by two packages/libs/... .

Spring (boot) works exactly the same. We once found that 30% of CPU time is spent parsing path regexes in Controllers somewhere deep inside the Spring. We had rewritten 1500 endpoints to hardcoded paths and it fixed CPU usage.
I've seen the same in Python, probably a dozen times. Sometimes folks think it's ugly (un-pythonic) but there's plenty of cases in the standard library to point to.
That's because the Python regex module caches the regexes it compiles, so it only happens once. It's proper and good usage to specify the regex string inline, even in a hot path.

I'd only use a variable when I'm using the same regex multiple times in code, and even then I could still just have the variable be the string.

> That's because the Python regex module caches the regexes it compiles, so it only happens once. It's proper and good usage to specify the regex string inline, even in a hot path.

Last time I had a look the regex cache was pretty small (few hundred entries) and gets completely cleared when full. Might have improved since, but historically it was very simplistic.

I disagree that it’s “proper and good usage” to specify regex inline. It’s fine for many usages but that’s as far as I’d go.

Even then, hashing and lookup is completely unnecessary in a hot path. Having a variable with a compiled regex is not unpythonic AFAIK
Yeah, it's been a while since I've benchmarked that, I'll try it out
Gotta agree with the sibling comments here. The performance difference is definitely smaller than it used to be, but there's still good reasons to keep compiled regexes in a module scope.

Caveat: I write libraries, not "production" code; my requirements are significantly more strict. One thing I can't do is make assumptions about where my code will run. If you're using my library, and you compile a whole bunch of regexes, they'll evict my regexes from the cache. I don't want the performance of my library to suffer, so I'll keep them in the module scope.

>Memory usage was going to be fiddly no matter what they built with.

That is true. I do find however the explicitness of the Rust way of dealing with memory, whether it be lifetimes, who can and can't mutate it and who the memory belongs to, makes it much easier to reason about the right way of doing these things.

In C++ the same is often possible, but there is no way to have guarantees at the interfaces. Const is a promise that your function won't mutate something, it doesn't put any restrictions on the caller. Pass by reference doesn't guarantee that the reference will be kept alive.

Go (I guess with no experience there) probably has fewer footguns, but how explicit is memory management?

Rust and Go are equivalently explicit about the memory concerns surfaced in this analysis.
Yeah, and perhaps someone who knows rust well could argue some things are easier to do right in rust. For example, in the second bullet, pass readers could be more of the norm in libraries since rust in a systems programming language. Third bullet to similar point.

I'm not saying rust is better or they made the wrong choice, sounds like C++ would let users easily make the same "wrong" choices, just interesting to carry the thoughts through a bit further.

io.Reader() and io.Writer() are used everywhere in Go, it's really a standard practice.

https://tour.golang.org/methods/21

> They were using a protobuf diffing library that used `reflect` under the hood, which allocates. They generated their own explicit object inspection thingies.

IIRC it is `reflect.Type.FieldXXX` which is the main culprit of allocations. Since the number of types in a typical application are bounded and small, you can get pretty far by just precomputing/caching struct fields.

Reflection APIs seem to be pretty messy and slow in every runtime I've ever used, perhaps because the idea of optimizing them might encourage more use. The C# reflection APIs also allocate a lot.
A thing you can ding Go for is that you can find yourself relying on `reflect` (under the hood) more than you expect, because it's how you do things like read struct tags for things like JSON.

But that's not what the problem was here; the product they were building was using `reflect` in anger. They were relying on something that did magic, pulling a rabbit out of its hat to automatically compare protobuf thingies. They used it on a hot path. The room quickly filled with rabbit corpses. I guess you can blame Go for the existence of those kinds of libraries, but most perf-sensitive devs know that they're a risk.

Reflection is also typically needed for anything that needs to be generic over types. For example, if you want to write a function that can traverse or transform a map or slice, where the actual types aren't known at compile time. We have a lot of this in our Go code at the company I work for. I'm really looking forward to generics, which will help us rip out a ton of reflect calls.
That kind of code is generally non-idiomatic in Go. An experienced Go programmer looks at something that is generic over types and does something interesting and instinctively asks "what gives, where are the dead rabbits?".

I'm less excited about generics. There's a cognitive cost to them, and the constraint current Go has against writing type-generic code is often very useful, the same way a word count limit is useful when writing a column. It changes the way you write, and often for the better.

This argument is getting a little tiresome though, isn't it? It isn't simply enough to call something "non-idiomatic" to gloss over a deficiency. There's a cognitive cost to all language features, but most other general purpose statically typed programming languages seem to have come to the conclusion that the benefit outweighs the cost for some form of generics.

I am by no means a Go basher, it is one of my favorite languages. But I eagerly await generics.

I could have written this more clearly. The fact that things that are generic over types are non-idiomatic today in Go has nothing to do with whether the upcoming generics feature is good or bad. They're unrelated arguments.

The latter argument is subjective and you might easily disagree. The former argument, about experienced Go programmers being wary when an API is generic over types, is pretty close to an objective fact; it is a true statement about conventional Go code.

I’m so conflicted on that point. I’ve been writing a high performance CRDT in rust for the last few months, and I’m leaning heavily on generics. For example, one of my types is a special b-tree for RLE data. (So each entry is a simple range of values). The b-tree is used in about 3-4 different contexts, each time with a different type parameter depending on what I need. Without genetics I’d need to either duplicate my code or do something simpler (and slower). I can imagine the same library in javascript with dynamic types and I think the result would be easier to read. But the resulting executable would run much slower, and the code would be way more error prone. (I couldn’t lean on the compiler to find bugs. Even TS wouldn’t be rich enough.)

Generics definitely make code harder to write and understand. But they can also be load bearing - for compile time error checking, specialisation and optimization. I’m not convinced it’s worth giving that up.

If we can be at a place where reasonable people can disagree about generics, I'm super happy, and think we've moved the discourse forward. There are things I like about generics, particularly in Rust (I've had the displeasure of dealing with them in C++, too). They're just not an unalloyed good thing.
Before writing Clojure, Rich Hickey wrote FOIL[1], which used sockets to communicate between common lisp and the JVM (or CLR). When asked about making it in-process, Rich observed that the reflection overhead on the JVM was often as large, or larger, than the serialization overhead, so the gains to be had were limited.

1: http://foil.sourceforge.net/

From what I recall, the Java team copped to the intentionally slow accusation, but that started to change when they decided to embrace the notion of other languages besides Java running on the JVM. Unfortunately that would have been shortly after Clojure was born. It took a few releases for them to really improve that situation, and that was still shortly before they started doing faster releases.
The usual C# reflection APIs that devs turn to allocate a lot, but there are ways to make them almost performant by (re)using delegates and expressions. There are a number of good libraries to use reflection faster, as well.
IIRC, dynamically compiled expression trees in C# have the overhead of a single virtual call (on the resulting delegate) when executing - and cover all object factory and member access scenarios. But if you need to discover metadata, you still have to resort to Reflection APIs.
The main corner case that still causes me problems is that if you want to construct a delegate at runtime, this often forces you to go through the reflection APIs to actually grab the method even if you know its full signature, etc. My current project has a JIT compiler for scripts that has this problem (I ended up finding a workaround involving getting LINQ to generate method tokens in an assembly, but .NET Core / .NET 5 deprecated LINQ compilation...)
They're pretty simple in many dynamic languages, eg you can just do "import os; dir(os)" in Python.
yeah, but that doesn't mean they're fast or don't make a mess of the gc heap
I hadn't heard people use messy to refer to garbage cration before!

To continue with Python, yes, you might get a new container (dict) allocated like in the above case to hold the already existing interned attribute name strings. It's still quite light since the object representing the information already exist and are used under the hood in the dynamic typing machinery.

It's a question I ask often in interview, how do you upload a 5GB file over the network with only 1MB of memory.
I am even not sure what this question is aiming at - I hope you are phrasing it more detailed than put here, or it would fit in those posts about the problems with interview questions :).

Assuming the file is on a disk and the 1 MB refers to the system memory - like you do with any potentially unbound data, you read and write it in chunks. Reading in data of any kind in whole is only reasonable, if you can clearly set an upper bound for its size.

This is not vague, it aims to see if the candidate has a notion of buffering, streaming ect ... the numbers don't really matter, you would be surprised at how many candidate have no idea how to load data in memory.
> with only 1MB of memory

Is that total system memory?

Well I left Kentucky back in '49 and went to Detroit work'n on assembly line..
To such a vague question, the only answer can be "less than 1MB at a time". Not terribly useful.
hello world in Go is 1.9 MB
I am not saying it's bad given what go runtime can do
That's it, in most cases, performance is not down to the language but to how it's used. "Mechanical Sympathy" is one of those terms that might apply here as well.

As to the issues you mentioned, there's a few 'adages' you could apply; "always use readers", "don't use reflect if you can help it", "move unchanging expressions to package level", etc.

Maybe I'm just incompetent but why would you do this?
> Frankly I'm surprised Go acquitted itself as well as it did here.

As opposed to, e.g. Java, which I ranted elsewhere in the thread, is a trashy mess. I programmed for over a decade in Java, and yeah, it's only gotten worse over the years. They would have done even more custom processing and bypassing of the layers underneath due to Java's typical copy-happiness.

This kind of analysis and remediation would work just as well in Java and is often a more rigorous and effective approach than the author's somewhat Java-inspired initial idea of fiddling with GC parameters.

One big difference is that the Java runtime design intent is more in the vein of 'converting memory into performance'. On HN, Ron Pressler ('pron) has written a bunch of interesting stuff about that over the years

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Java the language has improved a lot IME. If you're talking about some specific library, then I don't know.