| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by blorgle 3507 days ago

> First off, the replication thing. It is true that ES, C*, and Mongo replicate within their cluster mostly automatically. However, this is not without cost. It takes non-trivial amounts of network capacity, disk I/O, and CPU cycles to migrate shards from a failed (or downed) node to a newly stood-up node. Often, many GBs must be moved and for something like ES, where shard replicas reside on many different nodes, that means much of your cluster feels the impact of this. The cluster can heal, but healing isn't easy.

I'm talking about replication, not sharding though. If the data is actually lost then you have to bear the penalty of re-replicating it to match your replica count regardless, there's no magic wand here to do with "persistent storage". If the data isn't actually lost (e.g. due to CoreOS automagic reboots) then you absolutely should be putting the cluster into maintenance mode until the reboots are complete.

> As for Cinder, it's reliant on OpenStack APIs which (at least as of Juno) are reliant on things like RabbitMQ. We've seen a number of OpenStack failures due to RabbitMQ partitioning and split-brained scenarios.

Still pretty confused when you mention OpenStack. Cinder doesn't rely on OpenStack APIs per se, it provides an OpenStack API (for block storage). RabbitMQ clustering has longstanding issues with partitions which are mentioned explicitly in the documentation, nothing to do with OpenStack, everything to do with Erlang MNESIA DB. Any decent OpenStack team has learned by now to use singleton RabbitMQs with a master/slave configuration loadbalancer (i.e. haproxy) in front.

> We're also back to the disk-on-network problem again: SCSI backplane ---ethernet---> client will never be as fast as local disk.

Right. But wasn't the comment about persistent storage? You're never going to have persistent storage in your k8s cluster that magically avoids that problem, so not really sure what the point is here.