Hacker News new | ask | show | jobs
by pat2man 481 days ago
Seems easy to find: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
1 comments

3FS: 180 nodes, 2x200Gbps InfiniBand and 16x 14TiB NVMe SSDs per node, ~500 clients, 6.6 TiB/s of read throughput with training jobs workload

Ceph: 68 nodes, 2x100Gbps Mellanox and 10x 14TiB NVMe SSDs per node, 504 clients, 1TiB/s of FIO random read workload

The comparison is a little pears to apple. Similar nutritions but different enough to not draw conclusions. The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).

I also assume that the batch size (block size) is different enough that this alone would make a big difference.

Even if we take different hardware into account we can readjust for measured vs theoretical throughput.

Ceph cluster achieves 1 TiB/s / 1.7 TiB/s = 0.58% of theoretical throughput.

3FS cluster achieves 6.6 TiB/s / 9 TiB/s = 0.73% of theoretical throughput.

That difference is still pronounced, yes. But the workload is so different. Training AI is hardly random read. Still not a comparison which should lead you to any conclusions.