| > Techincal I'm trying to share as much technical across this thread as for your two examples: System upgrades: Keep in mind that as per the ISO specification, system upgrades should be applied but in a controlled manner. This lends itself perfectly to the following case that is manually triggered. Since we take steps to make applications stateless, and Ansible scripts are immutable: We spin up a new machine with the latest packages and once ready it join the Cloudflare load balancer. The old machines are drained and deprovisioned. we spin up a new machine We have a playbook that iterates through our machines and does it per machine before proceeding. Since we have redundancy on components, this creates no downtime. The redundancy in the web application is easy to achieve using the load balancer in Cloudflare. For the Postgres database, it does require that we switch the read-only replica to become the main database. DB failover: The database is only written and read from by our web applications. We have a second VM on a different cloud that has a streaming replication of the Postgres database. It is a hot standby that can be promoted. You can use something like PG Bouncer or HAProxy to route traffic from your apps. But our web framework allows for changing the database at runtime. > Business Before migration (AWS): We had about 0.1 FTE on infra — most of the time went into deployment pipelines and occasional fine-tuning (the usual AWS dance).
After migration (Hetzner + OVHCloud + DIY stack): After stabilizing it is still 0.1 FTE (but I was 0.5 FTE for 3-4 months), but now it rests with one person. We didn’t hire a dedicated ops person.
On scaling — if we grew 5-10×:
* For stateless services, we’re confident we’d stay DIY — Hetzner + OVHCloud + automation scales beautifully.
* For stateful services, especially the Postgres database, I think we'd investigate servicing clients out of their own DBs in a multi-tenant setup, and if too cumbersome (we would need tenant-specific disaster recovery playbooks), we'd go back to a managed solution quickly. I can't speak for cloud FTE toll vs a series of VPS servers in the big boys league ($ million in monthly consumption) and in the tiny league but at our league it turns out that it is the same FTE requirement. Anyone want to see my scripts, hit me up at jk@datapult.dk. I'm not sure it'd be great security posture to hand it out on a public forum. |
Have you considered doing your own HA Load balance? If yes what tech options did you consider