Hacker News new | ask | show | jobs
by asguy 1293 days ago
> we’re good at consul

Thank god someone is. I’ve lost more of my life to consul partition failures that’s any other part of the nutech stack.

3 comments

Yup our experience with brittleness of consul and nomad soured me on the HashiCorp stack even though I was very excited about it. We went from buying enterprise licenses to throwing it all out without ever pushing to production in the span of a year.

This stuff, for all the hype about raft and things, is brittle enough and requires enough special attention.

Which is why I’d gladly rather have it be fly.io’s problem.

For what it's worth, if I was deploying a bounded set of applications or services in a small number of geographies and couldn't use Fly.io for some reason, I wouldn't hesitate to use Nomad. Nomad is pretty great; it's like Flask to K8s's Django.
I did love the simplicity of nomad.

And in general, nomad worked pretty well for us but our consul cluster kept mysteriously failing. I think that caused our nomad cluster to fail because it was backed by Consul.

The one complaint I did have about nomad (same as consul) was that the recovery process was manual where you had to manually generate a peers.json.

I was shocked when I saw that. Truly one of the "finding how the sausage is made" moments even though I've managed linux servers for two decades – I always assumed it would use zeroconf/bonjour/multicast DNS (remember cloud auto-join?) or something similarly elegant to auto discover other nodes in the network and just reconnect and rebuild a cluster. I mean what's the point of all this stuff if it can't be used to recover a cluster and just Do The Right Thing™? The shiny new experience is stellar (like sales, or setting up a new cluster), but the flip side (when things go wrong) is a mess. That's why we eventually said "nope!" to all the custom stuff and went with boring, plain vanilla ECS, which is itself too much now that we've started using fly.

Don't ever want to even think about having to hand-write a peers.json file to recover a cluster, boot things up, and pray to the ancient gods that it works.

We don't have time for that nonsense. Please, take my money, Fly/Render/everyone else. Your costs are a margin of error compared to what I had to pay a devops person to build our own stack. (I'm not even exaggerating. It was six figures. DevOps people are worth every penny but they cost many, many pennies.) Ultimately, we never used the infra.

I want to focus on building solutions for my customers and not fiddling with weird server stuff.

how long ago did you use nomad? nomad integrates with consul but isn't backed by it. it's also pretty trivial to run consul in quite a shitty network environment by bumping up some of the settings (they should probably change their 'production' suggestions).
We decided to retire the whole infra about 8 months ago. A lot of our consul complications happened about 18 months ago.

The consul clusters would keep failing in QA (they were running on t2.nanos – but that should be plenty of bandwidth for raft not to blow up every couple weeks, same happened with t2.micros too).

Before we pulled the plug we had started seeing something about ec2 health checks failing and the autoscaling groups yanking servers out and replacing them with new servers. but this is exactly the kind of case where consul should've just added the new machine right in. Instead, the 3 node cluster (now 2 nodes) would just sit there saying "hey I can't find a leader... aaaah. I can't find a leader" – well, to paraphrase Mike Myers on SNL, TALK AMONGST YUHSELVES and figure it out, there are two of you remaining.

They've had their fair share of pains with it, but they still seem pretty happy with it https://fly.io/blog/a-foolish-consistency/
Happy is a word. There are lots of words. I like words! You could be creative about what word could take the place of "happy" in that sentence, and probably still be correct. Get weird with it! Maybe "sanguine" would work. "Engaged".

The reality is: we've got a fair bit of experience with Consul at this point, we respect the hell out of it for the problems it was designed to solve, and we're unlikely to stretch it any further than we've already stretched it. Distributed lock service for Postgres clusters? Sure. Source of truth for all our app state? We've built our own thingy ("corrosion", a Rust distribute state system) to phase Consul out with. I'll get Jerome to say things about it.

>Sure. Source of truth for all our app state? We've built our own thingy ("corrosion", a Rust distribute state system) to phase Consul out with. I'll get Jerome to say things about it.

Looking forward to the blog post!

Sorry yes. 'Invested' was more along what I was thinking :)
What else would you include in the nutech stack? HashiCorp products in general?