Hacker News new | ask | show | jobs
by aguacaterojo 865 days ago
Very similar story for my team, incl. the 2x cert expiry cluster disasters early on requiring a rebuild. We migrated from Kubespray to kOPs (with almost no deviations from a default install) and it's been quite smooth for 4 or 5 years now.

I traded ELK for Clickhouse & we use Fluentbit to relay logs, mostly created by our homegrown opentelemetry-like lib. We still use Helm, Quay & Drone.

Software architecture is mostly stateless replicas of ~12x mini services with a primary monolith. DBs etc sit off cluster. Full cluster rebuild and switchover takes about 60min-90min, we do it about 1-2x a year and have 3 developers in a team of 5 that can do it (thanks to good documentation, automation and keeping our use simple).

We have a single cloud dev environment, local dev is just running the parts of the system you need to affect.

Some tradeoffs and yes burned time to get there, but it's great.