Hacker News new | ask | show | jobs
by posnet 790 days ago
The biggest current limitation with cloud providers when it comes to exchange tech is the lack of real multicast support. It is rare outside of exchanges, but extremely low latency L1 multicast market data has become the backbone of exchanges, both for fairness and for scalability.

Knowing you can saturate your entire network with 10G traffic and every participant will get the same market data packets at the same time[0], and there will be zero queuing or bottlenecks is very hard to do otherwise. There is a pretty good podcast episode about it out of Jane Street[1].

I know AWS have 'multicast support' but last time I tested it, it was clearly just uni-cast traffic with a software switch doing fan-out/copying, I assume using the same tech as their transit gateway, I think it was called hyperplane or something.

[0]: for some definition of the same time, at least low enough that you can't measure it without equidistant optical splitters or White Rabbit synced devices.

[1]: https://signalsandthreads.com/multicast-and-the-markets/

7 comments

> Knowing you can saturate your entire network with 10G traffic and every participant will get the same market data packets at the same time[0]

Hold on a second. Multicast is nifty, but it does not perform miracles. If you operate a 10G multicast network and actually saturate it, you will experience drops and buffering-induced delays. Perhaps you can play games with time-synchronous networking, but as far as I know the exchanges don’t do this, and it likely needs special hardware.

The point of 10G multicast is to use simple, standard (but complex to configure!) equipment to distribute much less than 10Gbps simultaneously.

Generally speaking how multicast is used in trading situations[1] is you have two networks. On the primary network you do most of your normal IP traffic between applications etc. Then you have a seperate marketdata network that has most of the multicast traffic and it's exclusively used for marketdata. Marketdata generally is delivered on an "As fast as possible" basis[2]. So you don't care too much about occasional drops although fewer is obviously better.

[1] At least in my time in the front office.

[2] For example a very common pattern at the very low level for a marketdata subscription is when you subscribe to marketdata for some symbol the system will actually have a double buffer where it writes into one slot and you read from another slot and every time you read it switches the slots around. This means you can generally accept marketdata as fast as it arrives and process it when you can and you will always get the most recent packet when you ask for the next packet.

I’ve seen some multicast market data protocols with remarkably poor ability to detect or recover from drops. And they are very much not of the form where a newer datagram supersedes the older one.
Yes, although people have been doing marketdata networks the way I said above using IP multicast for at least 20 years now, so in general they choose protocols and network architectures carefully to minimise problems. You do see problems from time to time but they are somewhat rare. Some of the restrictions are interesting. For example IP multicast was basically completely banned on the trading floor where I worked except for marketdata, because of an IP multicast snafu from some random application that took out the whole network once.

One thing to realise about marketdata specifically is it's really different from other low-latency situations most people are familiar with (netcode in a game for example). As I mentioned before, it's not that big a deal generally to miss a few packets - the thing that is a big deal is to make decisions based on stale data. So you're not generally trying to reconstruct the full state after a drop- you just want the freshest current packet as fast as possible. If/when you need to reconstruct state you can make specific requests if needed.

Out of curiosity, what specific protocol are you taking about? I’m very familiar with a couple of these protocols, and the issue has nothing to do with wanting to know the history of the data — the issue is that they use what is, in effect, an ad-hoc delta encoding, and you can’t reliably reconstruct the complete current state if you’re missing a packet. On top of this, sometimes the packet sequence numbering is designed creatively, to be polite about it, so you don’t necessarily find out that you missed a packet as soon as you would like to be able to.
I'm curious if you know what, at a switch level, would actually cause drops and buffering for a 1:N (near-) saturated multicast flow. If all the packets are coming from the same source machine at (perhaps) 9.9Gbps and flowing into the switch, I would expect the switch to robustly redirect all that data with near-zero latency or packet drops to all its output ports. I don't think 10G Ethernet has "backpressure" in a way that would allow some output ports to get slowed down.

If there are other data flows also going through the switch, that could obviously change things, and the sending computer could drop packets if there's jitter in how quickly the application produces them, but it seems impossible for the sending computer to burst packets into the switch any faster than it can handle because all the incoming packets are coming over the same 10G link.

Not an expert here, legitimately curious.

A modern cut thru switch typically has one ASIC for a group of ports, and that ASIC handles all the traffic. If the traffic for all of those ports is greater than what the ASIC can handle, you'll have buffering and/or drops.

That being said, the ASIC can typically handle line rate on all the ports. You could have 10G input and fan it out to 10G output on all the ports with no drops, but if there is other cross port traffic, something could get dropped.

The nature of this kind of traffic is that it's pretty bursty. Think 100x-200x the normal packet rate in the span of a millisecond. Perfect opportunity for drops. Ultra low latency switches have tiny buffers.
Periodic background traffic like DHCP and background noise causing packet loss.

You can’t run a queue at 100% and have any expectations of latency. In fact the rule of thumb from queueing theory is 50% to avoid latency spikes.

I mean it's not that hard to eliminate all other traffic on a closed network like that, at least where there's millions of dollars at stake.

Must be nice to open Wireshark and see _nothing_.

That would be highly uninteresting to the rest of us don’t you think?
People get weird about the word “saturate”. I think GP is switching to “max sustainable” and expecting everyone else to come along for the ride.

Queuing theory has many many bad things to say about actual saturation.

add "carrier sense multiple access" packet creation
Are people still using 10G in PROD? I thought 40G and 100G had generally replaced that. I have 10G cards in my homelab that are a decade old and cost less than $100.
We are 400G to the machines and 800G on the splines.
Cool! Even faster than I realized. What application(s) can actually push 400G through a machine though?
Data transfer for training. It goes directly to the GPUs via RoCE. That said, we also stack the boxes with a bunch of NVMe as well, so you can cache there first, if you want so that you don't have to worry about network bandwidth as much. We're flexible on customers needs.

When 800G nic's come out next year (along with PCIe6), we will start buying those as well so that it is 800G everywhere. Let's see how long that lasts before it is considered slow... heh.

10G serialisation delay is lower than 40G or 100G.

Most markets can disseminate their feeds on 10G effectively. This isn’t true of the major US exchanges.

Almost all exchanges disseminate market data over multicast these days. If you miss a tick it doesn't matter because by the time a tcp retransmission completes this is old, useless data.
I ran into this problem a while back working at a company that was working to distribute video streams with low latency (lower than Low-Latency HLS) to a large number of viewers. Initially a prototype was built on top of AWS with fan-out/copying and it was terrible. This was partially due to inefficiency, but also due to each link being a reliable stream, meaning dropped packets were re-broadcast even though that isn't really useful to live video.

Moving to our own multicast hardware not only greatly improved performance, but also greatly simplified the design of the system. We required specialized expertise, but the overall project was reasonably straightforward. The biggest issue was that now we had a really efficient packet-machine-gun which we could accidentally point at ourselves, or worse, can be pointed at a target by a malicious attacker.

This 1-N behavior of multicast is both a benefit and a significant risk. I really think there is opportunity for cloud providers to step in and provide a packaged solution which mitigates the downsides (i.e. makes it very difficult to misconfigure where the packet-machine-gun is pointing). My guess is that this hasn't happened yet because there aren't enough use-cases for this to be a priority (the aforementioned video use case might be better served by a more specialized offering), but exchanges could be a really interesting market for such a product.

It would be pretty efficient to multi-cast market state in an unreliable way, and have a fallback mechanism to "fill in" gaps where packets are dropped that is out-of-band (and potentially distributed, i.e. asking your neighbors if they got that packet)

In AWS you don't even do neighbour discovery through ARP. Or that's a lie, you do, you get a arp reply, but it's not from any of your devices on the network. And traffic is authenticated and authorized at both the source and destination, so you can't do fun things like manipulating arp tables. You get a lot of nice features when you have a fully software defined network, but it comes with a couple of caveats, like you mentioned here. I doubt we'll ever see "real multicast support" in the sense that network engineers are used to.
Unless you build your network for it, multicast is a huge pain in the ass to administer. None of the big cloud providers built for it at the scale that traders use it, and I think they prefer things that way. When customers want it, they all just fake it by doing fan-out unicast.
All hyperscalers have an SDN that essentially spoofs local ARP/DHCP inside the hypervisor and does not support broadcast or multicast by design (there are some caveats here, since some telco protocols that require them can be made to work).
> lack of real multicast support

Yup, this is a problem for us in GCP today even outside of trading. I don't know how Pub/Sub works for them.

Pub/sub systems in unicast-only environments are very complex distributed systems to handle the load involved in fan-out routing while maintaining a global order. I had an interviewer once get annoyed with me for suggesting using multicast to solve the fan-out part of a pub/sub system, which made the global ordering part small and simple.

We lost a lot by thinking of HTTP as the one true level of network abstraction.

A reliable multicast network that preserves global order even during maintenance and doesn’t drop packets is not something you will find off the shelf.

A reliable multi-tenant multicast network also appears to be a rare beast. I’ve only heard of it in finance, and that’s only because it’s private and expensive and all the participants need to be generally nice to each other because it’s a repeated game and the operator can literally pull the plug if the rules are broken.

There was a time in the 00’s where a lot of server hardware had 3 NICs and you could use those for redundancy but a better use was to create three networks: inbound, service and database calls, and administrative.

You had more control over your services talking to each other and control plane tech, thus could make some more guarantees than with inbound data. Don’t cross the streams.

Do you regularly allow untrusted machines onto your private pub/sub instances? I'm not sure the "operator pulling the plug" part is unique to the finance industry.

Also, yeah, you have to do some engineering around your multicast distribution to make a pub/sub system, but multicast pretty much solves the data rate scaling problem - you are now basically O(1) in the number of connected subscribers.

Here is a research paper I recently wrote about a fair and scalable multicast in the cloud: https://arxiv.org/abs/2402.09527

I would love some feedback!

I find it sad that equal access between the entities doing HFT and regular Joes is not required for fairness, but god forbid one HFT having some milisecond advantage over another. That would be unfair. Can't have that.
Because average Joes don't do algorithmic trading, and if they do it not at the level that HFT does. Not even all the big financial players care about HFT and millisecond timing, so they're in the same boat.
What do you think an Average Joe is going to do with that extra millisecond available to them?
I think Average Joe has quite a lot of disadvantages compared to some hedge fund or whatever it is that does HFT. Do you really think the only difference between them is a millisecond? That's only between HFT traders.

My point was that other advantages/disadvantages are not being cared about, not that we should provide milisecond access to Average Joe.

Of course the Average Joe has disadvantages compared to a hedge fund or people who are experts in a field and spend the bulk of their life dedicated to some aspect that the Average Joe is not dedicated towards.

If Average Joe wants returns comparable to these hedge funds, then they should stop trying to time to market and instead stick to diversified ETFs and stop worrying about millisecond differences in the stock market.

Believe it or not, if Average Joe does that they can actually beat most hedge funds over a long time horizon [1].

https://www.cnbc.com/2018/02/16/warren-buffett-won-2-point-2...

I mean, you could also write "of course whoever has servers closer to the exchange can do HFT better". "Of course companies that invested a lot in having servers closer will reap the advantages."

No, expertise is not the difference. If you're a private person with 100 years experience in trading, you still can't do HFT. You need to be an instutition, have lots of capital to invest in servers, software development maintainance etc. As a private person I think you don't even get access to the API.

"Of course whoever has more capital has advantages in the market"? Of course they do, but I don't think "of course they should".

For this discussion, that funds don't beat the market average over a long term is irrelevant. Why not say "who cares you get more latency than the other bank, if you want money just invest in S&P500 and long term you'll beat them". But you don't apply that to banks against banks, only to banks against Joe. Why?

>of course whoever has servers closer to the exchange can do HFT better

Can do better at what? Can get their trade in the order book faster? Yes they can. But does that automatically mean they will make more money? No it does not.

>If you're a private person with 100 years experience in trading, you still can't do HFT.

Of course not, 100 years ago there were no computers. Having 100 years of experience in trading on the pit would not give you any expertise in software development.

Someone with 10 thousand years of experience plowing can't compete against someone with a tractor. That's kind of the point of the tractor...

I'm sorry that it disturbs you that the Average Joe sitting at home with his discount online brokerage account is unable to gain the same kind of benefits putting out individual orders here and there on speculative stocks that he likely knows nothing about, that hedge funds, institutions, and other highly specialized and skilled professionals are able to gain by doing this for a living.

The Average Joe does have access to highly diversified and low fee ETFs, and as I said the Average Joe can reap almost all of the rewards that the best hedge funds and banks do by sticking to those instead of trying to play the market.

Regular Joe is so bad at trading stocks that hedge funds literally give him a discount on the national best bid/offer for the privilege of being allowed to trade with him.

Giving retail traders access to the "actual market" would most likely result in worse execution on average, according to some studies.

It seems you're confused about how competition works among hft shops. There is no regulated, same latency for us. We compete for faster access just like everyone else.
Likewise, banks chronologically rearrange the transactions in checking accounts to maximize overdraft fees. Yet when I suggest batching and chronologically randomizing the transactions on exchanges to reduce the benefits of low latency / centrality, people behave as though I have transgressed against Moloch.
I think it's just beautiful how you made this point and other people couldn't even understand the concept of equal access to data and trading platforms. Yes, that's your point exactly.