Hacker News new | ask | show | jobs
by grw_ 13 days ago
Yeah, thunderbolt-net is IP on top and it does work as you say, with a few caveats:

- On a single cable with two rails available, the thunderbolt-net grabs one and uses that. Without patching the kernel, there's no way to make it present a second interface using the remaining pair.

- If you had a second cable between the machines (for 4 total rails), thunderbolt-net will still only grab one rail, because the abstraction across which it's making the links sees an identical peer at the end of both links and so falls into the same trap as above. There is no LRO/GRO anyway (or it's buggy- I forget) on the linux version.

- Why you only get 10G rather than 20G on single pair- actually, this might be something specific to the Strix Halo SoC that I was testing on- on a different (still AMD) chipset and an Apple TB5 Mac I did see closer to 22G in one direction, but still 8 in the other. The Strix Halo NHI seems to be 'stripped down' (as expected, for mobile) in ways I don't really understand.

- Intuition on why- I can't point you to the line number, but I think it has to do with a fixed 4kb page size when communicating with the NHI that ends up becoming a bottleneck, perhaps 16kb pages on aarch64 apple help here?

1 comments

Ugh, yeah, gross for `thunderbolt-net` only support one link in total, though presumably fixable.

> - Intuition on why- I can't point you to the line number, but I think it has to do with a fixed 4kb page size when communicating with the NHI that ends up becoming a bottleneck, perhaps 16kb pages on aarch64 apple help here?

I'm used to page size making a difference (due to TLB pressure) but not a factor of 2. I'm not familiar with DMA, so maybe there's some reason it'd be that dramatic there, but I'm unsure.

If the total size vs the latency of draining is just so small that it frequently fills and stalls, or if the sender and receiver can't be accessing it at once (but I don't think should be true?), it might make more sense. I think if I were wanting to make this thing go more smoothly, I'd probably start by measuring fractions of the time the tx/rx buffers are completely empty and completely full.

Actually, I'm not sure I'm understanding the text "we only have a single DMA ring for tx and rx" either. Does that mean one for tx and one for rx? or really one ring in total? if the latter, does it have to say drain fully before switching modes? that would seem pretty crippling.