Hacker News new | ask | show | jobs
by zbjornson 3525 days ago
(Aside from the issues with reason #1 for not using consul noted below...)

The author says that Atlas is essentially "required" for bootstrapping. We simply use our cloud provider's "list-instances" command (with a filter) so that we don't rely on any third-party and never contact the internet. This is important to us for security and stability, and has kept our cluster running with over 1k members. We don't want to go down when Atlas or etcd's discovery service goes down.

3 comments

At my work, we have figured out a way to do this completely automated. We use Packer + Ansible for creating the AMI, Terraform to setup AWS and launch 3 Consul instances based on the AMI.

Last part is we have Go program (started with Bash) to do the joining.

It finds all the Consul server nodes (using tags) and then runs `consul join` until it succeeds.

There's a bunch of error checking and timeouts and such to make sure it works correctly.

I'm hoping to post a blog post about this in the next month or so.

To add to this, I run `consul join` against all instances tagged as consul servers with the output of the instance list, on boot, using cloud-init's "per-boot" script area.
I haven't meant that you need to use atlas, but I have never seen a nice automation implementation for a scalable consul cluster that doesn't use atlas. I've also built something around atlas, mostly using hiera in puppet - but since I used atlas and the etcd discovery, everything else looks like a workaround.

What provisioning do you use to integrate with your cloud providers api?

We use consul in our infrastructure, and by default none of our internal hosts have internet access, so we can't use Atlas even if we wanted it. We deploy consul through salt, but information about our consul servers is extracted by our internal CMDB which predates consul.

I'm curious how etcd is bootstrapping? I don't see a better way to do it than using multicasting.

BTW: I absolutely hate that consul (and looks like etcd has this issue as well) is using http for communicating. It's so inefficient to obtain updates about changes these way. Zookeeper which everyone loves to hate (and it wasn't even created for service discovery), did this so much better, you have a single long standing connection where you subscribe what updates you want to receive, it has much lower overhead and is simpler to code with.

Another thing that seems to be lost buy people who promote their service discovery solution is that you don't need to be always consistent and eventually consistent is perfectly fine. You don't really need raft or paxos to do it.

Since v3 they're using grpc, which means http/2 - since than, long polling is much more efficient!
It still feels like a hack instead doing it properly. You're doing long pool instead what you should do in first place, which is push.

Again, I did not work with etcd yet, but in consul because it's "RESTful" i when you monitor multiple services you need to maintain multiple requests.

Edit: Reading more about http 2's frames and pipelining, looks like it's possible to use it similar way it's done in ZK[1]. If GRPC allows that then I suppose it indeed solves this problem.

[1] having a single long standing connection that's not closed after receiving a response. The request frames could be used to place watches and response frames would send the updates to the client.

etcd use bi-directional streams for watchers. One. TCP connections can maintains multiple streams. No matter what you need to keep at least one connection. ZooKeeper is not an exception.
That's good to hear. Seems like that would work the same way then.
We use Ansible Tower to bootstrap our consul clusters.