| Original creator and maintainer of Reclaim the Stack here. > you also removed monitoring of the platform No we did not:
Monitoring: https://reclaim-the-stack.com/docs/platform-components/monit... Log aggregation: https://reclaim-the-stack.com/docs/platform-components/log-a... Observability is on the whole better than what we had at Heroku since we now have direct access to realtime resource consumption of all infrastructure parts. We also have infinite log retention which would have been prohibitively expensive using Heroku logging addons (though we cap retention at 12 months for GDPR reasons). > Who/What is going to be doing that on this new platform and how much does that cost? Me and my colleague who created the tool together manage infrastructure / OS upgrades and look into issues etc. So far we've been in production 1.5 years on this platform. On average we spent perhaps 3 days per month doing platform related work (mostly software upgrades). The rest we spend on full stack application development. The hypothesis for migrating to Kubernetes was that the available database operators would be robust enough to automate all common high availability / backup / disaster recovery issues. This has proven to be true, apart from the Redis operator which has been our only pain point from a software point of view so far. We are currently rolling out a replacement approach using our own Kubernetes templates instead of relying on an operator at all for Redis. > Now you need to maintain k8s, postgresql, elasticsearch, redis, secret managements, OSs, storage... These are complex systems that require people understanding how they internally work Thanks to Talos Linux (https://www.talos.dev/), maintaining K8s has been a non issue. Running databases via operators has been a non issue, apart from Redis. Secret management via sealed secrets + CLI tooling has been a non issue (https://reclaim-the-stack.com/docs/platform-components/secre...) OS management with Talos Linux has been a learning curve but not too bad. We built talos-manager to manage bootstrapping new nodes to our cluster straight forward (https://reclaim-the-stack.com/docs/talos-manager/introductio...). The only remaining OS related maintenance is OS upgrades, which requires rebooting servers, but that's about it. For storage we chose to go with simple local storage instead of complicated network based storage (https://reclaim-the-stack.com/docs/platform-components/persi...). Our servers come with datacenter grade NVMe drives. All our databases are replicated across multiple servers so we can gracefully deal with failures, should they occur. > Who is going to upgrade kubernetes when they release a new version that has breaking changes? Ugrading kubernetes in general can be done with 0 downtime and is handled by a single talosctl CLI command. Breaking changes in K8s implies changes to existing resource manifest schemas and are detected by tooling before upgrades occur. Given how stable Kubernetes resource schemas are and how averse the community is to push breaking changes I don't expect this to cause major issues going forward. But of course software upgrades will always require due diligence and can sometimes be time consuming, K8s is no exception. > What happens when ElasticSearch decides to splitbrain and your search stops working? ElasticSearch, since major version 7, should not enter split brain if correctly deployed across 3 or more nodes. That said, in case of a complete disaster we could either rebuild our index from source of truth (Postgres) or do disaster recovery from off site backups. It's not like using ElasticCloud protects against these things in any meaningfully different way. However, the feedback loop of contacting support would be slower. > When the DB goes down or you need to set up replication? Operators handle failovers. If we would lose all replicas in a major disaster event we would have to recover from off site backups. Same rules would apply for managed databases. > What is monitoring replication lag? For Postgres, which is our only critical data source. Replication lag monitoring + alerting is built into the operator. It should be straight forward to add this for Redis and ElasticSearch as well. > Or even simply things like disks being close to full? Disk space monitoring and alerting is built into our monitoring stack. At the end of the day I can only describe to you the facts of our experience. We have reduced costs to cover hiring about 4 full time DevOps people so far. But we have hired 0 new engineers and are managing fine with just a few days of additional platform maintenance per month. That said, we're not trying to make the point that EVERYONE should Reclaim the Stack. We documented our thoughts about it here: https://reclaim-the-stack.com/docs/kubernetes-platform/intro... |
- Front page links to docs and disord.
- First page of docs only has a link to discord.
- Installation references a "get started" repo that is... somehow also the main repo, not just "get started"?