Hacker News new | ask | show | jobs
by blorgle 3463 days ago
This doesn't really make much sense to me.

If your systems support software level replication (Elasticsearch, Cassandra, MySQL, MongoDB all do) then why do you need persistent storage? You just need container scheduling anti-affinity and enough replicas.

You only need persistent storage for systems which don't support that replication. Ceph can certainly be deployed as performant for DB workloads.

You say "Cinder has the stench of OpenStack" but Cinder is just a Python based webapp which povides an API to arbitrary storage backends (Ceph RBD, iSCSI, NetApp ONTAP, whatever). How can it be "better now"? It doesn't provide storage on its own. If your ops team was using the default "proof of concept" LVM backend then I could see how you might get a bad impression but that just means your ops team doesn't know much about OpenStack.

Am I missing something obvious?

1 comments

Yes, you're missing a few things. Maybe not obvious, though.

First off, the replication thing. It is true that ES, C*, and Mongo replicate within their cluster mostly automatically. However, this is not without cost. It takes non-trivial amounts of network capacity, disk I/O, and CPU cycles to migrate shards from a failed (or downed) node to a newly stood-up node. Often, many GBs must be moved and for something like ES, where shard replicas reside on many different nodes, that means much of your cluster feels the impact of this. The cluster can heal, but healing isn't easy.

Why would a cluster node go down? It's not always hardware failure. CoreOS regularly self-updates and reboots itself without intervention. In a Kubernetes cluster, this is a non-event because pods are simply rescheduled elsewhere the the degradation is momentary. If we were talking about 300 GB of persistent data, though, that's a serious amount of data that will get reshuffled every time there is a node reboot, especially when you consider that an Elasticsearch cluster may span dozens of physical nodes and experience dozens of node reboots in the course of a normal day. Maybe we could hack something that would disable shard reallocation in ES (there's a setting for this) when scheduled reboots happen but that's pretty hacky. Besides, ES is just one of a number of different datastores in use at my workplace.

As for Cinder, it's reliant on OpenStack APIs which (at least as of Juno) are reliant on things like RabbitMQ. We've seen a number of OpenStack failures due to RabbitMQ partitioning and split-brained scenarios. We're also back to the disk-on-network problem again: SCSI backplane ---ethernet---> client will never be as fast as local disk.

> First off, the replication thing. It is true that ES, C*, and Mongo replicate within their cluster mostly automatically. However, this is not without cost. It takes non-trivial amounts of network capacity, disk I/O, and CPU cycles to migrate shards from a failed (or downed) node to a newly stood-up node. Often, many GBs must be moved and for something like ES, where shard replicas reside on many different nodes, that means much of your cluster feels the impact of this. The cluster can heal, but healing isn't easy.

I'm talking about replication, not sharding though. If the data is actually lost then you have to bear the penalty of re-replicating it to match your replica count regardless, there's no magic wand here to do with "persistent storage". If the data isn't actually lost (e.g. due to CoreOS automagic reboots) then you absolutely should be putting the cluster into maintenance mode until the reboots are complete.

> As for Cinder, it's reliant on OpenStack APIs which (at least as of Juno) are reliant on things like RabbitMQ. We've seen a number of OpenStack failures due to RabbitMQ partitioning and split-brained scenarios.

Still pretty confused when you mention OpenStack. Cinder doesn't rely on OpenStack APIs per se, it provides an OpenStack API (for block storage). RabbitMQ clustering has longstanding issues with partitions which are mentioned explicitly in the documentation, nothing to do with OpenStack, everything to do with Erlang MNESIA DB. Any decent OpenStack team has learned by now to use singleton RabbitMQs with a master/slave configuration loadbalancer (i.e. haproxy) in front.

> We're also back to the disk-on-network problem again: SCSI backplane ---ethernet---> client will never be as fast as local disk.

Right. But wasn't the comment about persistent storage? You're never going to have persistent storage in your k8s cluster that magically avoids that problem, so not really sure what the point is here.