Hacker News new | ask | show | jobs
by amluto 790 days ago
> Knowing you can saturate your entire network with 10G traffic and every participant will get the same market data packets at the same time[0]

Hold on a second. Multicast is nifty, but it does not perform miracles. If you operate a 10G multicast network and actually saturate it, you will experience drops and buffering-induced delays. Perhaps you can play games with time-synchronous networking, but as far as I know the exchanges don’t do this, and it likely needs special hardware.

The point of 10G multicast is to use simple, standard (but complex to configure!) equipment to distribute much less than 10Gbps simultaneously.

5 comments

Generally speaking how multicast is used in trading situations[1] is you have two networks. On the primary network you do most of your normal IP traffic between applications etc. Then you have a seperate marketdata network that has most of the multicast traffic and it's exclusively used for marketdata. Marketdata generally is delivered on an "As fast as possible" basis[2]. So you don't care too much about occasional drops although fewer is obviously better.

[1] At least in my time in the front office.

[2] For example a very common pattern at the very low level for a marketdata subscription is when you subscribe to marketdata for some symbol the system will actually have a double buffer where it writes into one slot and you read from another slot and every time you read it switches the slots around. This means you can generally accept marketdata as fast as it arrives and process it when you can and you will always get the most recent packet when you ask for the next packet.

I’ve seen some multicast market data protocols with remarkably poor ability to detect or recover from drops. And they are very much not of the form where a newer datagram supersedes the older one.
Yes, although people have been doing marketdata networks the way I said above using IP multicast for at least 20 years now, so in general they choose protocols and network architectures carefully to minimise problems. You do see problems from time to time but they are somewhat rare. Some of the restrictions are interesting. For example IP multicast was basically completely banned on the trading floor where I worked except for marketdata, because of an IP multicast snafu from some random application that took out the whole network once.

One thing to realise about marketdata specifically is it's really different from other low-latency situations most people are familiar with (netcode in a game for example). As I mentioned before, it's not that big a deal generally to miss a few packets - the thing that is a big deal is to make decisions based on stale data. So you're not generally trying to reconstruct the full state after a drop- you just want the freshest current packet as fast as possible. If/when you need to reconstruct state you can make specific requests if needed.

Out of curiosity, what specific protocol are you taking about? I’m very familiar with a couple of these protocols, and the issue has nothing to do with wanting to know the history of the data — the issue is that they use what is, in effect, an ad-hoc delta encoding, and you can’t reliably reconstruct the complete current state if you’re missing a packet. On top of this, sometimes the packet sequence numbering is designed creatively, to be polite about it, so you don’t necessarily find out that you missed a packet as soon as you would like to be able to.
I'm curious if you know what, at a switch level, would actually cause drops and buffering for a 1:N (near-) saturated multicast flow. If all the packets are coming from the same source machine at (perhaps) 9.9Gbps and flowing into the switch, I would expect the switch to robustly redirect all that data with near-zero latency or packet drops to all its output ports. I don't think 10G Ethernet has "backpressure" in a way that would allow some output ports to get slowed down.

If there are other data flows also going through the switch, that could obviously change things, and the sending computer could drop packets if there's jitter in how quickly the application produces them, but it seems impossible for the sending computer to burst packets into the switch any faster than it can handle because all the incoming packets are coming over the same 10G link.

Not an expert here, legitimately curious.

A modern cut thru switch typically has one ASIC for a group of ports, and that ASIC handles all the traffic. If the traffic for all of those ports is greater than what the ASIC can handle, you'll have buffering and/or drops.

That being said, the ASIC can typically handle line rate on all the ports. You could have 10G input and fan it out to 10G output on all the ports with no drops, but if there is other cross port traffic, something could get dropped.

The nature of this kind of traffic is that it's pretty bursty. Think 100x-200x the normal packet rate in the span of a millisecond. Perfect opportunity for drops. Ultra low latency switches have tiny buffers.
Periodic background traffic like DHCP and background noise causing packet loss.

You can’t run a queue at 100% and have any expectations of latency. In fact the rule of thumb from queueing theory is 50% to avoid latency spikes.

I mean it's not that hard to eliminate all other traffic on a closed network like that, at least where there's millions of dollars at stake.

Must be nice to open Wireshark and see _nothing_.

That would be highly uninteresting to the rest of us don’t you think?
People get weird about the word “saturate”. I think GP is switching to “max sustainable” and expecting everyone else to come along for the ride.

Queuing theory has many many bad things to say about actual saturation.

add "carrier sense multiple access" packet creation
Are people still using 10G in PROD? I thought 40G and 100G had generally replaced that. I have 10G cards in my homelab that are a decade old and cost less than $100.
We are 400G to the machines and 800G on the splines.
Cool! Even faster than I realized. What application(s) can actually push 400G through a machine though?
Data transfer for training. It goes directly to the GPUs via RoCE. That said, we also stack the boxes with a bunch of NVMe as well, so you can cache there first, if you want so that you don't have to worry about network bandwidth as much. We're flexible on customers needs.

When 800G nic's come out next year (along with PCIe6), we will start buying those as well so that it is 800G everywhere. Let's see how long that lasts before it is considered slow... heh.

10G serialisation delay is lower than 40G or 100G.

Most markets can disseminate their feeds on 10G effectively. This isn’t true of the major US exchanges.

Almost all exchanges disseminate market data over multicast these days. If you miss a tick it doesn't matter because by the time a tcp retransmission completes this is old, useless data.