| Totally valid concerns — I don’t disagree that DIY hosting comes with real risks that managed platforms abstract away (but AWS could close your account too). We didn’t go into this blind though — we spent a lot of time testing scenarios (including Hetzner/OVH support delays) and designing mitigation strategies. Some of what we do: • Our infra is spread across multiple providers (Hetzner, OVH)) + Cloudflare for traffic management. If Hetzner blackholes us, we can redirect within minutes.
• DB backups are encrypted and replicated nightly to various regions/providers (incl. one outside the primary vendors), with tested restore playbooks. The key point: no platform is free of counterparty risk — whether that’s AWS pulling a region for legal reasons, or Hetzner taking a server offline. Our approach tries to make the blast radius smaller and the recovery faster, while also achieving compliance and cutting costs substantially (~90% as noted). DIY is definitely not for everyone — it is more work, but for our particular constraints (cost, sovereignty, compliance) we found it a net win. Happy to share more details if helpful! Oh, an imagine being kicked out of AWS and you used Aurora.. My certified multi-cloud setup with standard components should not make you cringe. |
I probably won't be responding after this or in the future on HN because I took a significant blast off my karma for keeping it real and providing valuable feedback. You have a lot of people brigading accounts that punish those that provide constructive criticism.
Generally speaking AWS is incentivized to keep your account up so long as there is no legitimate reason for them taking it down. They generally vet claims with a level of appropriate due diligence before imposing action because that means they can keep billing for that time. Spurious unlawful requests cost them money and they want that money and are at a scale where they can do this.
I'm sure you've spent a lot of time and effort on your rollout. You sound competent, but what makes me cringe is the approach you are taking that this is just a technical problem when it isn't.
If you've done your research you would have ran across more than a few incidents where people running production systems had Hetzner either shut them down outright, or worse often in response to invalid legal claims which Hetzner failed to properly vet. There have also been some strange non-deterministic issues that may be related to hardware failing, but maybe not.
Their support is often a one response every 24 hours, what happens when the first couple responses are boilerplate because the tech didn't read or understand what was written. 24 hours + % chance of skipping the next 24 hours at each step; and no phone support, which is entirely unmanageable. While I realize they do have a customer support line, it is for most an international call and the hours are bankers hours. If your in Europe you'll have a lot easier time lining up those calls, but anywhere else and you are dealing with international calls with the first chance of the day being midnight.
Having a separate platform for both servers is sound practice, but what happens when your DAG running your logging/notification system is on the platform that fails, but not a failover. The issues are particularly difficult when half your stack fails on one provider, stale data is replicated over to your good side, and you have nonsensical, or invisible failures; and its not enough to force an automatic failover with traffic management which is often not granular enough.
Its been awhile since I've had to work with Cloudflare tm, so this may have become better but I'm reasonably skeptical. I've personally seen incidents where the RTO for support for direct outages was exceptional, but then the RTO for anything above a simple HTTP(200) was nonexistent with finger pointing, which was pointless because the raw network captures were showing the failure at L2/L3 traffic on the provider side which was being ignored by the provider. They still argued, and downtime/outage was extended as a result. Vendor management issues are the worst when contracts don't properly scope and enforce timely action.
Quite a lot of the issues I've seen with various hosting providers OVH and Hetzner included, are related to failing hardware, or transparent stopgaps they've put in place which break the upper service layers.
For example, at one point we were getting what appeared to be stale cache issues coming in traffic between one of a two backend node set on different providers. There was no cache between them, and it was breaking sequential flows in the API while still fulfilling other flows which were atomic. HTTP 200 was fine, AAA was not, and a few others. It appeared there was a squid transparent proxy placed in-line which promptly disappeared upon us reaching out to the platform, without them confirming it happened; concerning to say the least when your intended use of the app you are deploying is knowledge management software with proprietary and confidential information related to that business. Needless to say this project didn't move forward on any cloud platform after that (and it was populated with test data so nothing lost). It is why many of our cloud migrations were suspended, and changed to cloud repatriation projects. Counter-party risk is untenable.
Younger professionals I've found view these and related issues solely as technical problems, and they weigh those technical problems higher than the problems they can't weigh because of lack of experience and something called the streetlamp effect which is an intelligence trap often because they aren't taught a Bayes approach. There's a SANS CTI presentation on this (https://www.youtube.com/watch?v=kNv2PlqmsAc).
The TL;DR is a technical professional can see and interrogate just about every device, and that can lead to poor assumptions and an illusion of control which tend to ignore problems and dismiss them when there is no real clear visibility about how those edge problems can occur (when the low level facilities don't behave as they should). The class of problems in the non-deterministic failure domain where only guess and check works.
The more seasoned tend to focus more on the flexibility needed to mitigate problems that occur from business process failures, such as when a cooperative environment becomes adversarial, which necessarily occurs when trust breaks down with loss, deception, or a breaking of expectations on one parties part. This phase change of environment, and the criteria is rarely reflected or touched on in the BC/DR plans; at least the ones that I've seen. The ones I've been responsible for drafting often include a gap analysis taking into account the dependencies, stakeholder thoughts, and criteria between the two proposed environments, along with contingencies.
This should includes legal obviously to hold people to account when they fail in their obligations but even that is often not enough today. Legal often costs more than simply taking the loss and walking away absent a few specific circumstances.
This youthful tendency is what makes me cringe. The worst disasters I've seen were readily predictable to someone with knowledge of the underlying business mechanics, and how those business failures would lead to inevitable technical problems with few if any technical resolutions.
If you were co-locating on your own equipment with physical data center access I'd have cut you a lot more slack, but it didn't seem like you are from your other responses.
There are ways to mitigate counter-party risk while receiving the hosting you need. Compromises in apples to oranges services given the opaque landscape rarely paint an objective view, which is why a healthy amount of skepticism and disagreement is needed to ensure you didn't miss something important.
There's an important difference between constructive criticism intended to reduce adverse cost and consequence, and criticisms that simply aren't based in reality.
The majority of people on HN these days don't seem capable of making that important distinction in aggregate. My relatively tame reply was downvoted by more than 10 people.
These people by their actions want you to fail by depriving you of feedback you can act on.