Hacker News new | ask | show | jobs
by one_buggy_boi 878 days ago
Is modern Ceph appropriate for transactional database storage, how is the IO latency? I'd like to move to a cheaper cfs that can compete with systems like Oracle's clustered file system or DBs backed by something like Veritas. Veritas supports multi-petabyte DBs and I haven't seen much outside of it or ocfs that similarly scales with acceptable latency
2 comments

Not sure about putting DBs on CephFS directly, but Ceph RBD can definitely run RDBMS workloads.

You need to pay attention to the kind of hardware you use, but you can definitely get Ceph down to 0.5-0.6 ms latency on block workloads doing single thread, single queue, sync 4K writes.

Source, I run Ceph at work doing pretty much this.

It is important to specify which kind of latency percentile this is. Checking on a customer's cluster (made from 336 SATA SSDs in 15 servers, so not the best one in the world):

  50th percentile = 1.75 ms
  90th percentile = 3.15 ms
  99th percentile = 9.54 ms
That's with 700 MB/s of reads and 200 MB/s of writes, or approximately 7000 reads IOPS and 9000 writes IOPS.
These numbers may be good enough for your use case but from what’s possible with SSDs these numbers aren’t great. Please note, I mean well. Still a cool setup.

I’d like to see much more latency consistency and 99th even sub ms. Might want to set a latency target with fio and see what kind of load is possible until 99 hits 1ms.

However, I can say all of this but it’s all about context and depending on workload your figures may be totally fine.

Latency is quite poor, I wouldn't recommend running high performance database loads there.
From my dated experience, Ceph is absolutely amazing but latency is indeed a relative weak spot.

Everything has a trade-off and for Ceph you get a ton of capability but latency is such a trade-off. Databases - depending on requirements - may be better off on regular NVMe and not on Ceph.

It's pretty unfair to compare latency of a local NVMe SSD to over-the-network 3x replicated storage. "It's faster if I do less."

[Disclaimer: ex-Inktank employee]

No, it's important when planning - eg: one big database cluster that provides db-as-a-service (but maybe needs some dedicated ops resources) vs smaller DBs with virtualized storage on ceph (ops resources for ceph cluster and vm tools like k8s).

If the latter is too slow for your typical usage...

Oh, don't get me wrong, you will pay a price for disaggregated highly available storage, and you might need to evaluate whether you want to pay that price or not. But those are two very different worlds, and only one of them gives you elastic disk size, replication, scale-out throughput, and so on.

GP makes Ceph sounds worse than it is, when reality is that just shoving all your reads & writes over the network, writes multiple times because of replication, is gonna cost you no matter what tech you build that with.

I don’t think it’s unfair, there are applications that still are ok with Ceph latencies: I bet it’s good enough for a ton of things.

But not all things.