| I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found: The cluster has 68 nodes, each a Dell PowerEdge R6615 (https://www.delltechnologies.com/asset/en-us/products/server...). The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding. This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput (https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not bottlenecked by the U.2 link (8 Gbyte/s). Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s. However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck. Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link. Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can reduce performance "up to 5x slower with some workloads" (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.) |
I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).
One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.