|
|
|
|
|
by __turbobrew__
442 days ago
|
|
I already set things up with Rook as we are super heavily invested into kubernetes, and things are working well so far. I built out a test cluster to 1PiB and was able to push more than a terrabit/second through the cluster which was good. I also set up topology aware replication so pg’s can be spread across racks/datacenters. My main worry now is disaster recovery. From what I have seen, object recovery is quite manual if you lose any. I would like to write some scripts so we can bulk mark objects which we know are actually lost. We already have a loki setup, so ceph logs just get put into there. |
|
When I read this I think "but you should never lose an object". Do you mean like the underlying data chunks Ceph stores? Can you elaborate on this part? I know some of the teams I work with do things in unorthodox ways and we tend to operate on different assumptions than others.
> so pg’s can be spread across racks/datacenters.
Some Ceph pools come to mind (this was a while ago, I'm sure they're still running though) where the erasure coding was done across cabinet rows and each cabinet row was on its own power distribution. I don't know how the power worked but I was told rather forwardly that some specific Ceph pools' failure domains aligned with the datacenter's failure domains.
> We already have a loki setup
Nice. We have logs go into S3 and then anyone who prefers a particular tool is welcome to load whatever sets of logs from S3 within the resource limits set for whatever K8s namespace they work with. Originally keeping logs append-only in S3 was for compliance but we wanted to limit team members by RAM quota rather than tools in line with the "people over tools over process" DevOps maxim.