Hacker News new | ask | show | jobs
by jabl 2663 days ago
> The other pieces -- not really sure what you're getting at.

What I'm getting at is setting up clusters larger than what you can fit behind a single switch. So you'll want e.g. a CLOS fabric with multipathing (the typical IB setup, FWIW). As Trill and SPB seem pretty dead, it seems the momentum is to do the multipathing at the L3 level, using the aforementioned EPVN+VXLAN+BGP, or something similar.

1 comments

You really don't need EVPN+VXLAN though. (And if you do need it I recommend finding a way to not need it.)
You mean you have separate subnets for each leaf switch, and then BGP or such for multipath routing between the leaf and spines? Sure, but what about subnet-level services like DHCP & PXE? Sounds cumbersome if you have to replicate that across all your leaf switches?

Or maybe you could do one "provisioning and admin" VLAN that spans the entire cluster and which uses spanning tree, and then the high-performance RDMA stuff uses the per-leaf VLAN's and L3 multipath routing? Is that simpler and better performing that EVPN + VXLAN?

What is the routing latency on such BGP setups BTW? I find it hard to image you can get even close to eth (not to mention IB!) L2 latencies? Or can the fast paths be done in hw (or FPGA's)?

Yes, a subnet per rack is a best practice. Often people DHCP & PXE over the 1G out-of-band network which is dumb L2.

In most ASICs everything is the same latency since packets go through the whole pipeline whether they use all the functionality or not. Anyway, the latency of plain routing would have to be equal or faster than VXLAN encap + routing.