Hacker News new | ask | show | jobs
by dbackeus 641 days ago
Original creator and maintainer of Reclaim the Stack here.

> you also removed monitoring of the platform

No we did not: Monitoring: https://reclaim-the-stack.com/docs/platform-components/monit...

Log aggregation: https://reclaim-the-stack.com/docs/platform-components/log-a...

Observability is on the whole better than what we had at Heroku since we now have direct access to realtime resource consumption of all infrastructure parts. We also have infinite log retention which would have been prohibitively expensive using Heroku logging addons (though we cap retention at 12 months for GDPR reasons).

> Who/What is going to be doing that on this new platform and how much does that cost?

Me and my colleague who created the tool together manage infrastructure / OS upgrades and look into issues etc. So far we've been in production 1.5 years on this platform. On average we spent perhaps 3 days per month doing platform related work (mostly software upgrades). The rest we spend on full stack application development.

The hypothesis for migrating to Kubernetes was that the available database operators would be robust enough to automate all common high availability / backup / disaster recovery issues. This has proven to be true, apart from the Redis operator which has been our only pain point from a software point of view so far. We are currently rolling out a replacement approach using our own Kubernetes templates instead of relying on an operator at all for Redis.

> Now you need to maintain k8s, postgresql, elasticsearch, redis, secret managements, OSs, storage... These are complex systems that require people understanding how they internally work

Thanks to Talos Linux (https://www.talos.dev/), maintaining K8s has been a non issue.

Running databases via operators has been a non issue, apart from Redis.

Secret management via sealed secrets + CLI tooling has been a non issue (https://reclaim-the-stack.com/docs/platform-components/secre...)

OS management with Talos Linux has been a learning curve but not too bad. We built talos-manager to manage bootstrapping new nodes to our cluster straight forward (https://reclaim-the-stack.com/docs/talos-manager/introductio...). The only remaining OS related maintenance is OS upgrades, which requires rebooting servers, but that's about it.

For storage we chose to go with simple local storage instead of complicated network based storage (https://reclaim-the-stack.com/docs/platform-components/persi...). Our servers come with datacenter grade NVMe drives. All our databases are replicated across multiple servers so we can gracefully deal with failures, should they occur.

> Who is going to upgrade kubernetes when they release a new version that has breaking changes?

Ugrading kubernetes in general can be done with 0 downtime and is handled by a single talosctl CLI command. Breaking changes in K8s implies changes to existing resource manifest schemas and are detected by tooling before upgrades occur. Given how stable Kubernetes resource schemas are and how averse the community is to push breaking changes I don't expect this to cause major issues going forward. But of course software upgrades will always require due diligence and can sometimes be time consuming, K8s is no exception.

> What happens when ElasticSearch decides to splitbrain and your search stops working?

ElasticSearch, since major version 7, should not enter split brain if correctly deployed across 3 or more nodes. That said, in case of a complete disaster we could either rebuild our index from source of truth (Postgres) or do disaster recovery from off site backups.

It's not like using ElasticCloud protects against these things in any meaningfully different way. However, the feedback loop of contacting support would be slower.

> When the DB goes down or you need to set up replication?

Operators handle failovers. If we would lose all replicas in a major disaster event we would have to recover from off site backups. Same rules would apply for managed databases.

> What is monitoring replication lag?

For Postgres, which is our only critical data source. Replication lag monitoring + alerting is built into the operator.

It should be straight forward to add this for Redis and ElasticSearch as well.

> Or even simply things like disks being close to full?

Disk space monitoring and alerting is built into our monitoring stack.

At the end of the day I can only describe to you the facts of our experience. We have reduced costs to cover hiring about 4 full time DevOps people so far. But we have hired 0 new engineers and are managing fine with just a few days of additional platform maintenance per month.

That said, we're not trying to make the point that EVERYONE should Reclaim the Stack. We documented our thoughts about it here: https://reclaim-the-stack.com/docs/kubernetes-platform/intro...

3 comments

Since you're the original creator, can you open the site of your product, and find the link to your project that you open sourced?

- Front page links to docs and disord.

- First page of docs only has a link to discord.

- Installation references a "get started" repo that is... somehow also the main repo, not just "get started"?

The get-started repo is the starting point for installing the platform. Since the platform is gitops based, you'll fork this repo as described in: https://reclaim-the-stack.com/docs/kubernetes-platform/insta...

If this is confusing, maybe it would make sense to rename the repo to "platform" or something.

The other main component is k (https://github.com/reclaim-the-stack/k), the CLI for interacting with the platform.

We have also open sourced a tool for deploying Talos Linux on Hetzner called talos-manager: https://github.com/reclaim-the-stack/talos-manager (but you can use any Kubernetes, managed or self-hosted, so this is use-case specific)

You talk a lot about the platform on the page, in the overview page, and there are no links to the platform.

There's not even an overview of what the platform is, how everything is tied together, and where to look at it except bombastic claims, disparate descriptions of its constituent components (with barely any links to how they are used in the "platform" itself), and a link to a repo called "get-started"

Assuming average salary of 140k/year, you are dedicating 2 resources 3 times a month and this is already costing you ~38k/year on salaries alone and that's assuming your engineers have somehow mastered_both_ devops and software (very unlikely) and that they won't screw anything up. I'm not even counting the time it took you to migrate away..

This also assumes your infra doesn't grow and requires more maintenance or you have to deal with other issues.

Focusing on building features and generating revenue is much valuable than wasting precious engineering time maintain stacks.

This is hardly a "win" in my book.

Right, because your outsourced cloud provider takes absolutely zero time of any application developers. Any issue with AWS and GCP is just one magic support ticket away and their costs already includes top priority support.

Right? Right?!

Heroku isn’t really analogous to AWS and GCP. Heroku actually is zero effort for the developers.
> Heroku actually is zero effort for the developers.

This is just blatantly untrue.

I was an application developer at a place using Heroku for over four years, and I guarantee you we exceeded the aforementioned 2-devs-3-days-per-month in man hours in my time there due to Heroku:

- Matching up local env to Heroku images, and figuring out what it actually meant when we had to move off deprecated versions

- Peering at Heroku charts because lack of real machine observability, and eventually using Node to capture OS metrics and push them into our existing ELK stack because there was just no alternative

- Fighting PR apps to get the right set of env vars to test particular features, and maintaining a set of query-string overrides because there was no way to automate it into the PR deploy

I'm probably forgetting more things, but the idea that Heroku is zero effort for developers is laughable to me. I hate docker personally but it's still way less work than Heroku was to maintain, even if you go all the way down the rabbit hole of optimizing away build times et.

> Assuming average salary of 140k/year

Is that what developers at your company cost?

Just curious. In Sweden the average devops salary is around 60k.

> you are dedicating 2 resources 3 times a month and this is already costing you ~38k/year on salaries

Ok. So we're currently saving more than 400k/year on our migration. That would be worth 38k/year in salaries to us. But note that our actual salary costs are significantly lower.

> that's assuming your engineers have somehow mastered_both_ devops and software (very unlikely)

Both me and my colleague are proficient at operations as well as programming. I personally believe the skillsets are complimentary and that web developers need to get into operations / scaling to fully understand their craft. But I've deployed web sites since the 90s. Maybe I'm a of a different breed.

We achieved 4 nines of up time in our first year on this platform which is more than we ever achieved using Heroku + other managed cloud services. We won't reach 4 nines in our second year due to a network failure on Hetzner, but so far we have not had downtime due to software issues.

> This also assumes your infra doesn't grow and requires more maintenance

In general the more our infra grows the more we save (and we're still in the process of cutting additional costs as we slowly migrate more stuff over). Since our stack is automated we don't see any significant overhead in maintenance time for adding additional servers.

Potentially some crazy new software could come along that would turn out to be hard to deploy. But if it would be cheaper to use a managed option for that crazy software we could still just use a managed service. It's not like we're making it impossible to use external services by self-hosting.

Note that I wouldn't recommend Reclaim the Stack to early stage startups with minor hosting requirements. As mentioned on our site I think it becomes interesting around $5,000/month in spending (but this will of course vary on a number of factors).

> Focusing on building features and generating revenue is much valuable than wasting precious engineering time maintain stacks.

That's a fair take. But the trade-offs will look different for every company.

What was amazing for us was that the developer experience of our platform ended up being significantly better than Heroku's. So we are now shipping faster. Reducing costs by an order of magnitude also allowed us to take on data intensive additions to our product which we would have never considered in the previous deployment paradigm since costs would have been prohibitively high.

> Just curious. In Sweden the average devops salary is around 60k.

Well there's salary, and total employee cost. Now sure how it works in Sweden, but here in Belgium it's a good rule of thumb that an employer pays +- 2,5 times what an employee nets at the end after taxes etc. So say you get a net wage of €3300/month or about €40k/year ends up costing the employer about €100k.

I'm a freelance devops/sre/platform engineer, and all I can tell you is that even for long-term projects, my yearly invoice is considerably higher than that.

This is more FUD. Employer cost is nowhere near 2.5x employee wages.
Hey there, this is a comprehensive and informative reply!

I had two questions just to learn more.

* What has been your experience with using local NVMes with K8s? It feels like K8s has some assumptions around volume persistence, so I'm curious if these impacted you at all in production.

* How does 'Reclaim the Stack' compare to Kamal? Was migrating off of Heroku your primary motivation for building 'Reclaim the Stack'?

Again, asking just to understand. For context, I'm one of the founders at Ubicloud. We're looking to build a managed K8s service next and evaluating trade-offs related to storage, networking, and IAM. We're also looking at Kamal as a way to deploy web apps. This post is super interesting, so wanted to learn more.

K8s works with both local storage and networked storage. But the two are vastly different from an operations point of view.

With networked storage you get fully decoupled compute / storage which allows Kubernetes to reschedule pods arbitrarily across nodes. But the trade off is you have to run additional storage software, end up with more architectural complexity and get performance bottlenecked by your network.

Please check out our storage documentation for more details: https://reclaim-the-stack.com/docs/platform-components/persi...

> How does 'Reclaim the Stack' compare to Kamal?

Kamal doesn't really do much at all compared to RtS. RtS is more or less a feature complete Heroku alternative. It comes with monitoring / log aggregation / alerting etc. also automates High Availability deployments of common databases.

Keep in mind 37 signals has a dedicated devops team with 10+ engineers. We have 0 full time devops people. We would not be able to run our product using Kamal.

That said I think Kamal is a fine fit for eg. running a Rails app using SQLite on a single server.

> Was migrating off of Heroku your primary motivation for building 'Reclaim the Stack'?

Yes.

Feel free to join the Discord and start a conversation if you want to bounce ideas for your k8s service :)