Hacker News new | ask | show | jobs
by c0l0 643 days ago
As I had posted a few weeks ago (https://news.ycombinator.com/item?id=41085314), I recently implemented a very similar thing myself.

My solution ended up using tc's mirred[0] action for implementing a fully L2-transparent frame relay. I wonder if their setup achieves the same degree of transparency, because afaiui, that's just not possible involving a 802.1Q-compliant (Linux) bridge.

I spent close to a week optimizing my setup looking at kernel flame graphs and perf results, reading adapter-specific tuning guides and driver source, and can say that the only really meaningful performance optimizations (in both the Broadwell- and Zen3/Vermeer-based implementations I tried) were disabling mitigations in the kernel (esp. on Zen3, that boosted performance by more than 20%), and getting CPU frequency scaling/idle states sorted out correctly (which yielded much higher wins on the older Broadwell uarch, because power state transition appears to happen much quicker on Zen3).

As for the solution presented in the (on the whole really great; I love it!) article, I have my doubts about the effectiveness of the cargo-culted "sysctl tuning" mentioned - TCP, for example, is simply not involved at all in the described setup, so "tuning" its buffer allocations cannot have any effect on the workload.

Kudos to the writers for solving their problem in a creative, cost-effective and maintainable way! :)

[0]: https://www.man7.org/linux/man-pages/man8/tc-mirred.8.html

3 comments

> I wonder if their setup achieves the same degree of transparency, because afaiui, that's just not possible involving a 802.1Q-compliant (Linux) bridge.

Can you elaborate on what is not transparent about 802.1q bridge in Linux?

I hear you on the system tuning. Whenever I change sysctl variables I always include a comment with what the default was and why the new setting is better. I don't trust sysctl copy pasta w/o decent explanations.

There's a number of "special" Ethernet addresses that a proper Ethernet bridge must never forward. The Linux bridge implements a mechanism to ignore _some_ of these constraints, but not all of them. If you ned that, you can always get to manual patching in https://github.com/torvalds/linux/blob/d42f7708e27cc68d080ac... et al.
Thank you for your reply. I have had some weird issues with Linux bridges in the past and now I'm wondering if this could have been the culprit.
> include a comment

You may already do this, but in general, please include the Year, Month, Kernel Version and your own Name when doing this.

What mitigations did you disable, specific ones you know wouldn't be a risk to what the machines were doing (mostly network, mostly kernel space)..?

Like, by disabling the mitigations does that leave the servers slightly more open to someone nefarious finding a way to use some kind of timing attack to get some knowledge of your wireguard keys?

(Genuine question as someone with very little knowledge on both wireguard and *bleed CPU flaws)

No, I actually just booted with 'mitigations=off' and called it a day. We will employ Zen4 cores on the pre-prod setup soon enough, and I'll be looking into the benefit (if any) of disabling mitigations in a more fine-grained manner there.
> CPU frequency scaling/idle states sorted out correctly

please elaborate

To "fix" performance (i.e., increase throughput by close to 35%) one has to mess with the "energy performance bias" on the (Broadwell) platform, e. g. using x86_energy_perf_policy[0] or cpupower[1]. Otherwise, the CPUs/platform firmware will select to operate in a very dissatisfactory compromise between high-ish power consumption (~90W per socket), but substantially less performance than with having all idle states disabled (= CPU in POLL at all times, resulting in ~135W per core) completely. One can tweak things to reach a sweet spot in the middle, where you can achieve ~99% of the peak performance at very sensible idle power draw (i.e., ~25W when the link isn't loaded).

With Zen3, this hardly mattered at all.

I also got to witness that using IPv4 for the wireguard "overlay" network yielded about 30% better performance than when using IPv6 with ULA prefixes.

[0]: https://man.archlinux.org/man/x86_energy_perf_policy.8 [1]: https://linux.die.net/man/1/cpupower

tyvm

came across the epp thing once or twice but remained in the land of 'echo performance |tee /sys....'

if you can share anything related to your sweetspot

the v6-performance issue reeks of mtu

> if you can share anything related to your sweetspot

For Broadwell in particular, it is enough to avoid power states lower than C1E, in my experience.

And no, MTU plays no part in the degraded IPv6 performance. I think it's rooted in a less efficient route lookup mechanism (Linux 6.7 was what I tested with), but I did not take the time to check properly.