| HN Mirror

I think the big problem is it tries to do too much. We used to have many tools as SRE but now teams are really limited. We handed the keys to the engineers which I think is overall a good intention. But we didn’t set them up with sensible defaults, which left them open to making really bad decisions. We made it easy to increase the diversity in the fleet and we removed observability. I think things are more opaque, more complicated, and I have fewer tools to deal with it.

I miss having lots of tools to reach for. Lots of different solutions, depending on where my company was and what they were trying to do.

I don’t think one T-shirt size fits all. But here are some specific things that annoy me.

Puppet had a richer change management language than docker. When I lost puppet, we had to revert back to shitty bash scripts, and nondeterminism from the cicd builds. The worst software in your org is always the build scripts. But now that is the whole host state! So SREs are held captive by nonsense in the cicd box. If you were using Jenkins 1.x, the job config wasn’t even checked in! With puppet I could use git to tell me what config changed, for tracked state anyway. Docker is nice in that the images are consistent, which is a huge pain point with bad puppet code. So it’s a mixed bag.

The clouds and network infrastructure have a lot of old assumptions about hosts/ips/ports. This comes up a lot in network security, and service discovery, and cache infrastructure. Dealing with this in the k8 world is so much harder, and the cost and performance so much worse. It’s really shocking to me how much people pay because they are using these software based networks.

The Hypervisors and native cloud solutions were much better at noisy neighbor protection, and a better abstraction for carving up workloads. When I worked at AWS I got to see the huge lengths the ebs and ec2 teams put into providing consistent performance. VMWare has also done a ton of work on QoS. The os kernels are just a lot less mature on this. Running in the cloud inside a single vm removed most of the value of this work.

In the early 2010s, lots of teams were provisioning ec2 instances and their bills were easy to see in the bill as dollars and cents. At my last company, we were describing workloads as replicas/gbs/cpus/clusters on a huge shared cluster. Thousands of hosts, a dozen data centers.

This added layer of obfuscation hides true cost of a workload. I watched a presentation from a large well known software service company say that their k8 migration increased their cloud spend because teams were no longer accountable to spend. At my company, I saw the same thing. Engineers were given the keys on provisioning but were not in the loop for cost cutting. That fell to the SREs, who were blamed for exploding costs. The engineers are really just not prepared to handle this kind of work. They have no understanding about the implications in terms of cost and performance. We didn’t train them on these things. But we took the keys away from the SRE’s and handed it to the engineers.

The debugging story is particularly weak. Once we shipped on docker and K8 we lost ssh access to production. 10 years into the docker experiment, we now have a generation of senior engineers who don’t know how to debug. I’ve spent dozens of hours on conference calls while the engineers fumbled around. Most of these issues could have been diagnosed with netstat/lsof/perl -pe/ping/traceroute. If the issue didn’t appear in New Relic, then they were totally helpless. The loss of the bash one-liner is really detrimental to engineers progress.

There is too much diversity in the docker base images and too many of them stuck. The tool encourages every engineer to pick a different one. To solve this my org promised to converge on alpine. But if you use a docker distribution, now you are shipping all of user mode to every process. I was on the hook for fixing a libc exploit for our fleet. I had everyone on a common base image, so fixing all 80 of my host classes took me about a few days. But my coworkers in other orgs who had hundreds of different docker images were working on it a year later. Answering the question, which LibC am I on became very difficult.

Terraform has a better provisioning/migration story. Use that to design your network, perform migrations. Use the cloud native networking constructs. Use them for security boundaries. Having workloads move seamlessly between these “anything can be on me hosts” make security, a real nightmare.

I left being an SRE behind when I saw management get convinced docker/k8 was a cancer treatment, a desert topping and a floor wax. it’s been five years and I think I made the right call.