Hacker News new | ask | show | jobs
by ex3ndr 3567 days ago
We (actor.im) also moved from google cloud to our servers + k8s. Shared persistent storage is a huge pain. We eventually stopped to try to do this, will try again when PetSets will be in Beta and will be able to update it's images.

We tried:

* gluterfs - cluster can be setup in seconds, really. Just launch daemon sets and manually (but you can automate this) create a cluster, but we hit to that fact that CoreOS can't mount glusterfs shares at all. We tried to mount NFS and then hit next problem.

* NFS from k8s are not working at all, mostly this is because kubelet (k8s agent) need to be run directly on a machine and not via rkt/docker. Instead of updating all our nodes we mounted NFS share directly to our nodes.

* PostgreSQL we haven't tried yet, but if occasional pod kill will take place and then resyncing database can became huge issue. We ended up in running pods that is dedicated to specific node and doing manual master-slave configuration. We are not tried other solutions yet, but they also questionable in k8s cluster.

* RabbitMQ - biggest nightmare of all of them. It needs to have good DNS names for each node and here we have huge problems on k8s side: we don't have static host names at all. Documentation said that it can, but it doesn't. You can open kube-dns code it doesn't have any code at all. For pods we have only domain name that ip-like: "10-0-0-10". We ended up with not clustering rabbitmq at all. This is not very important dataset for us and can be easily lost.

* Consul - while working around problems with RabbitMQ in k8s and fighting DNS we found that Consul DNS api works much better than built-in kube-dns. So we installed it and our cluster just goes down when we kill some Consul pods as they changed it's host names and ip. And there are no straightforward way to fix IP or hostnames (they are not working at all, only ip-like that can easily changed on pod deletion).

So best way is to have some fast(!) external storage and mount it via network to your pods, this is much much slower than direct access to Node's SSD but it give you flexibility.

2 comments

As long as you associate a separate service with each RabbitMQ pod, you can make it work without petsets. (Setting the hostname inside the pod is trivial, just make sure it matches.) Then you can create a "headless" service for clients to connect to, which matches against all the pods.

If you set it up in HA mode, then in theory you don't need persistent volumes, although RabbitMQ is of course flaky for other reasons unrelated to Kubernetes -- I wouldn't run it if I didn't have existing apps that relies on it.

> RabbitMQ is of course flaky for other reasons unrelated to Kubernetes -- I wouldn't run it if I didn't have existing apps that relies on it.

I'm surprised because I know teams which are very satisfied with running RabbitMQ at scale. Could you elaborate?

RabbitMQ doesn't have a good clustering story. The clustering was added after the fact, and it shows. I've written about it on HN several times before, e.g. [1]. Also see Aphyr's Jepsen test of RabbitMQ [2], which demonstrates the problem a bit more rigorously.

With HA mode enabled, it will behave decently during a network partition (which can be caused by non-network-related things: high CPU, for example), but there is no way to safely recover without losing messages. (Note: The frame size issue I mention in that comment has been fixed in one of the latest versions.)

We have also encountered multiple bugs where RabbitMQ will get into a bad state that requires manual recovery. For example, it will suddenly lose all the queue bindings. Or queues will go missing. In several cases the RabbitMQ authors have given me a code snippet to run with the Erlang RELP to fix some internal state table; however, even if you know Erlang, you have to know the deep internals of RabbitMQ in order to think up such a code snippet. There have been a couple of completely unrecoverable incidents where I've simply ended up taking down RabbitMQ, deleted its Mnesia database, and started up a new cluster again. Fortunately, we use RabbitMQ in a way that allows us to do that.

The bugs have been getting fewer over the years, but they're not altogether gone. It's a shame, since RabbitMQ should have a model showcase for Erlang's tremendous support for distribution and fault-tolerance. You're lucky if you've not had any issues with it; personally, I would move away from RabbitMQ in a heartbeat, if we had the resources to rewrite a whole bunch of apps. We've started using NATS for some things where persistence isn't needed, and might look at Kafka for some other applications.

[1] https://news.ycombinator.com/item?id=9448258

[2] https://aphyr.com/posts/315-jepsen-rabbitmq

Thanks a lot for elaborating. This is exactly the kind of insights I wanted to know.
What do you recommend instead of RabbitMq on kubernetes? I use RabbitMq as Celery backend . I should probably switch to redis...