> The final aggregate read throughput reached approximately 6.6 TiB/s with background traffic from training jobs.
The Ceph team has been working on Crimson for years to get past performance bottlenecks inherent to the HDD-based design. I’m having troubles finding any ceph benchmark results that show any close to 100 GB/s.
The comparison is a little pears to apple. Similar nutritions but different enough to not draw conclusions. The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).
I also assume that the batch size (block size) is different enough that this alone would make a big difference.
That difference is still pronounced, yes. But the workload is so different. Training AI is hardly random read. Still not a comparison which should lead you to any conclusions.
Seems like Ceph is considerably lower in throughput: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ A serious concern when saving hundreds of terabytes of weights and optimizer states every now and again, or loading large precomputed prefix KV-caches. Minio seems to be slower still. IDK about SeaweedFS - they don't mention performance in their selling points at all.
It's quite funny that I got two opposite answers right away: you say it's to improve throughput, and sibling says it's to improve latency, and as we know throughput and latency trade off against each other. I'm inclined to agree it's more likely they're prioritizing throughput, since their readme charts throughput but not latency. But OTOH, the project looks like it requires RDMA. I wonder if the authors have written about their motivations and the tradeoffs they made, so we don't have to speculate.
Because the two are interconnected and aren't in conflict with each other. You not only want high throughput - that by itself would be quite limiting. You want it along with low latency as well, or else it's very easy to end up in the situation where your throughput is effectively zero if the access pattern is "bad".
The only competitors in the parallel FS space that are useful for this are Lustre and Weka.
Otherwise if you don't need a single namespace a bunch of fat AF NFSv4 servers w/NFS over RDMA will also get you to 6TiB/s.
The "surefire" way though is still Lustre, it's the big daddy of distributed parallel filesystems still but it's an absolute beast to setup and operate.
It’s not. When you are a high frequency trader and you’ve mastered RDMA, everything around you looks slow. You are thinking in terms of 20 nanoseconds intervals, while everyone around still thinks that serving a query under a millisecond is fast.
Huh? What kind of RDMA has a completion latency of 20 nanoseconds? It's more like 5 microseconds.
I agree that a lot of "modern" storage stack is way too slow though, tried to find a replication-first object storage for crazy-fast random read in small number of objects last year and found none.
Completion latency is one thing, bandwidth would be another. There's apparently a whole world of Alveo SmartNIC's and related FPGA platforms, and it can totally get in nanosecond range for whatever nails that may fit the compute-in-network hammer, even if bound by latency of the consuming system / RDMA interface. Also: https://github.com/corundum/corundum is really popular with the Chinese!
I was talking about, thinking in terms of 20 nanoseconds intervals, rather than completing a request in 20 nanoseconds. To get 1 microsecond wire-to-wire latency you do need to count your nanoseconds.
Why this number - this is because it’s roughly the time it takes to read 64 bytes from L3 cache. And NICs tend to be able to push data into L3 (or equivalents).
Current state of the art - look up nanoPU, from Stanford. Wire-to-wire under 100ns is not impossible, but this would normally assume pre-cooked packet, selected from a number of packets (which is not an unusual scenario in HFT).
Ah, makes sense. Sadly RDMA isn't that fast for now, or at least commercial RNICs/switches don't :( Once you left your host in data center network, everything counts in microseconds.
Software tech in China is a different landscape. It's really common to reinvent the wheel in China. Almost every big name (Bytedance, Meituan, etc) have their own of everything with both office political and in house need reasons.
The thing is those stuff are so prevalent those in house tech have reach the point they are competitive. This doubles for quant firm like DeepSeek.
The Ceph team has been working on Crimson for years to get past performance bottlenecks inherent to the HDD-based design. I’m having troubles finding any ceph benchmark results that show any close to 100 GB/s.