| A year ago I left my job at a 500-employee SaaS business working on the team that maintains the devops infrastructure, to found a startup. For me the biggest pain point is going from nothing to a sufficiently flexible devops setup. There are a lot of great tools out there, but making them play well together is an exercise for the reader. There are also a lot of preference-based choices you need to make in how you want your setup to look, and what you chose will affect what tools make sense to you. Do you go monorepo or polyrepo? If you go monorepo, how do you decide what to build and deploy on each merge? If you go polyrepo, how do you keep stuff in sync between any code you want to share? Once a build is complete, how do you trigger a deployment? How does your CI system integrate with your deployment system, or is the answer "with some shell scripts you have to write"? > How do you deploy resources? For us, we have a monorepo setup with bazel. I wrote some fairly primitive scripts to scan git changes to decide what to build. We use Buildkite for CI, which triggers rollouts to kubernetes with ArgoCD. I had to do a non-trivial amount of work to tie all this together, but it's been fairly robust and has only needed a minimal amount of care and feeding. > How do you define architecture? Kubernetes charts for our services are in git, but there's some amount of extra stuff deployed (ingress controller, for example) that is documented in a text file > How do you manage your environments We don't need to deploy environments super often, so just do it manually and update documentation in the process if any variations are needed. > observability Datadog and sumologic. Overall our setup doesn't come close to the setup I worked on at my last employer, but I have to balance time spent on devops infra with time spent on the product, and that setup took ~5 full time engineers to maintain. |
Out of curiosity, why just the "readmeware" for those components? I can't think of a single thing that requires clickops in a modern k8s setup, so much so that in the beginning we used to bring up the full stack from nothing based on a single CFN template - roles, load balancer, auto-scaling group, control plane, csi driver (this was back when EKS was a raging tire fire), and then lay the actual business apps on it. The whole process took about 8 minutes from go
If nothing else, one will want to be cautious about readmeware components in disaster recovery situations. If no one has run those steps in 6 months, and then there's some kind of "all hands on deck," the stress will likely make that institutional knowledge leak out of their ears