Hacker News new | ask | show | jobs
by do_not_redeem 483 days ago
Can someone convince me this isn't NIH syndrome? Why would you use this instead of SeaweedFS, Ceph, or MinIO?
7 comments

> The final aggregate read throughput reached approximately 6.6 TiB/s with background traffic from training jobs.

The Ceph team has been working on Crimson for years to get past performance bottlenecks inherent to the HDD-based design. I’m having troubles finding any ceph benchmark results that show any close to 100 GB/s.

3FS: 180 nodes, 2x200Gbps InfiniBand and 16x 14TiB NVMe SSDs per node, ~500 clients, 6.6 TiB/s of read throughput with training jobs workload

Ceph: 68 nodes, 2x100Gbps Mellanox and 10x 14TiB NVMe SSDs per node, 504 clients, 1TiB/s of FIO random read workload

The comparison is a little pears to apple. Similar nutritions but different enough to not draw conclusions. The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).

I also assume that the batch size (block size) is different enough that this alone would make a big difference.

Even if we take different hardware into account we can readjust for measured vs theoretical throughput.

Ceph cluster achieves 1 TiB/s / 1.7 TiB/s = 0.58% of theoretical throughput.

3FS cluster achieves 6.6 TiB/s / 9 TiB/s = 0.73% of theoretical throughput.

That difference is still pronounced, yes. But the workload is so different. Training AI is hardly random read. Still not a comparison which should lead you to any conclusions.
I'd argue that they don't need a filesystem or an object storage, they need a purpose-built data serving layer optimized for their usecase.
Seems like Ceph is considerably lower in throughput: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ A serious concern when saving hundreds of terabytes of weights and optimizer states every now and again, or loading large precomputed prefix KV-caches. Minio seems to be slower still. IDK about SeaweedFS - they don't mention performance in their selling points at all.
Look at the hardware first:

The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).

I also assume that the batch size (block size) is different enough that this alone would make a big difference.

It's quite funny that I got two opposite answers right away: you say it's to improve throughput, and sibling says it's to improve latency, and as we know throughput and latency trade off against each other. I'm inclined to agree it's more likely they're prioritizing throughput, since their readme charts throughput but not latency. But OTOH, the project looks like it requires RDMA. I wonder if the authors have written about their motivations and the tradeoffs they made, so we don't have to speculate.

EDIT: Their blog post answered all my questions and more. https://www.high-flyer.cn/blog/3fs/

Because the two are interconnected and aren't in conflict with each other. You not only want high throughput - that by itself would be quite limiting. You want it along with low latency as well, or else it's very easy to end up in the situation where your throughput is effectively zero if the access pattern is "bad".
None of those are close to fast enough.

The only competitors in the parallel FS space that are useful for this are Lustre and Weka.

Otherwise if you don't need a single namespace a bunch of fat AF NFSv4 servers w/NFS over RDMA will also get you to 6TiB/s.

The "surefire" way though is still Lustre, it's the big daddy of distributed parallel filesystems still but it's an absolute beast to setup and operate.

It’s not. When you are a high frequency trader and you’ve mastered RDMA, everything around you looks slow. You are thinking in terms of 20 nanoseconds intervals, while everyone around still thinks that serving a query under a millisecond is fast.
Huh? What kind of RDMA has a completion latency of 20 nanoseconds? It's more like 5 microseconds.

I agree that a lot of "modern" storage stack is way too slow though, tried to find a replication-first object storage for crazy-fast random read in small number of objects last year and found none.

Completion latency is one thing, bandwidth would be another. There's apparently a whole world of Alveo SmartNIC's and related FPGA platforms, and it can totally get in nanosecond range for whatever nails that may fit the compute-in-network hammer, even if bound by latency of the consuming system / RDMA interface. Also: https://github.com/corundum/corundum is really popular with the Chinese!
I was talking about, thinking in terms of 20 nanoseconds intervals, rather than completing a request in 20 nanoseconds. To get 1 microsecond wire-to-wire latency you do need to count your nanoseconds.

Why this number - this is because it’s roughly the time it takes to read 64 bytes from L3 cache. And NICs tend to be able to push data into L3 (or equivalents).

Current state of the art - look up nanoPU, from Stanford. Wire-to-wire under 100ns is not impossible, but this would normally assume pre-cooked packet, selected from a number of packets (which is not an unusual scenario in HFT).

Ah, makes sense. Sadly RDMA isn't that fast for now, or at least commercial RNICs/switches don't :( Once you left your host in data center network, everything counts in microseconds.
Software tech in China is a different landscape. It's really common to reinvent the wheel in China. Almost every big name (Bytedance, Meituan, etc) have their own of everything with both office political and in house need reasons.

The thing is those stuff are so prevalent those in house tech have reach the point they are competitive. This doubles for quant firm like DeepSeek.

If NIH syndrome boosts morale of the team, it should be helpful on overall team progress though.
Sometimes you must succumb to NIH. How do you think all those tools you mentioned got produced?