Hacker News new | ask | show | jobs
by marcc 2729 days ago
The Operator/CRD pattern is promising for autonomously operating simple use cases of existing software and for operating really complex software that needs very specific, rare knowledge to operate.

Unfortunately, we aren’t there yet for most software. Let’s take Postgres as an example. Even though you have to manage your pg database manually (or use a service that manages it for you), that’s just because the right automation software hasn’t been built yet. Someday, a Kubernetes Operator (or equivalent implementation) will exist that can manage a large Postgres cluster better than a team of DBAs. It’s crazy that there are hundreds (thousands?) of configuration parameters in Postgres, and these are coupled to the operating system settings in weird and unexpected ways that most people don’t know. We should be building this knowledge into a K8s Operator and letting that control our pg.conf and os configuration, instead of giving that control over to a team of humans who might be able to put in some sane defaults, but will always be working to get the optimal performance out of Postgres as the usage share changes.

This exists in some places already. For example, Rook is a K8s operator that provisions and manages Ceph in a Kubernetes cluster. As a small startup, if I need this functionality, I don't want to hire a full time Ceph admin to figure it out, and I don’t have the expertise to take on operating Ceph myself. Rook productized operating Ceph for us, and “baked in” all of the needed knowledge to manage block and object store and even set up concurrent, shared file systems. I trust Rook to manage Ceph, and I don’t think that I could do a better job with human intervention.

We have a long way to go. Operators are a tool that might help get us there but Operators are just a pattern that exists that we can use. One thing for sure is that we shouldn’t assume that human control over complex software is required to achieve optimal performance.

1 comments

What's your strategy for handling possible filesystem failure/corruption scenarios without a team that understands the underlying technology?
That's a great point. I do have a team that understands the underlying technologies and has been successful in troubleshooting several production problems with Rook/Ceph, one recent one including file system corruption. My original post is just trying to state that our engineering team does not maintain a deep operational knowledge of the best way to configure, manage, monitor, scale, etc (operate) ceph in production. We rely on the Rook operator for this.

Troubleshooting acute outages caused by hardware or software failures requires a different skill than properly configuring the system to scale and minimize the chances of a corruption or outages. Rook solves the later, but we do understand the architecture and what Rook (and Ceph) are doing. We've just removed the expert level, craftsman, speciality knowledge required to operator Ceph because we decided, after a thorough evaluation, that the software in this case is the most capable solution.

I find this unusual because usually the knowledge require to troubleshoot a complex piece of software is much more complex than that required to set it up in the first place. In other words, how can you troubleshoot it if you don’t know how it’s built?

It’s a bit like debugging software you didn’t write.