| HN Mirror

> So what did you do to get the most performance out of the system? pin the threads to specific CPUs?

No, they were already pinning to cores. Kernel was also in low-latency mode.

The project they handed to me was just the data plane, written in Rust. They also gave me a partial reference implementation of the control plane, meant for research purposes rather than performance. I had to add a lot of missing features to get it up to par with the supposedly industry-standard benchmark, which didn't exactly match spec so I had to reverse-engineer. Then I had to mate it with the data plane. (They originally suggested I build a control plane from scratch in Rust, but the lack of an ASN1 codegen lib for it made this infeasible within the time I had, considering also that I had 0 systems experience or familiarity with the protocols.) I don't remember all the optimizations, but the ones that still come to mind:

1. Fixing all their memory leaks, obviously. Kinda hard cause they were in C code auto-generated by a Python script. There was even a code comment // TODO free memory after use.

2. Improving the ASN1 en/decoding to take advantage of memcpy in some places. This is because asn1 has different alignment modes, some byte-aligned and some bit-aligned. The control plane used byte-aligned, but the standard ASN1.c lib understandably assumed bit-aligned in either case for simplicity's sake since it worked either way. So I added an optimized path for byte-aligned that used memcpy instead of inspecting each byte for the end markers. This was in a tight loop and made the biggest difference; basically every string copy got faster. The relevant function even knew it was in byte-aligned mode, so it was a simple fix once I figured it out; I tried to make a PR to improve this for everyone else, but forget why I couldn't.

3. Playing with different arrangements of passing messages between threads and different ways of locking. I forget all the ones we tried. Using "parking lot" locks instead of the default ones in the Rust portion helped, also more optimistic locks in other places instead of mutexes, I forget where and why. Since then I've come across the general concept of optimistic vs pessimistic locking a lot as something that makes or breaks performance, particularly in systems that handle money.

4. As I said, playing with the number of threads for each different setup in #3.

5. Playing with NIC settings. We were using Intel's DPDK library and optimized NIC drivers.

6. Making a custom `malloc` implementation that used a memory pool, was thread-scoped, and was optimized for repeated small allocs/deallocs specifically for a portion of the reused code that had a weird and inefficient pattern of memory access. I got it to be faster than the built-in malloc, BUT it was still break-even with DPDK's custom pooled malloc, so I gave up.

7. Branch hints. Tbh didn't make a big difference, even though this was pre Meltdown/Spectre.

8. Simplifying the telemetry. Idk if this helped performance, more of a rant... It's good enough to have some counters that you printf every 60s or something, then parse the logs with a Python script. That's very non-prone to bugs, anyone can understand it, and you can easily tell there's no significant impact on performance. It's overkill in this case to have a protobuf/HTTP client sending metrics to a custom sidecar process, complicating your builds, possibly impacting performance, and leaving no simple paper trail from each test. I respected the previous guy's engineering skills more than my own, but once I found a bug in that code, I took it out.