There's always a trade off between the cost of failing and the cost of engineering it out. The problem comes with the lack of understanding about where and how apps and intrastructure fail and how to avoid it. If you misunderstand the problem, you'll probably misjudge it.
People should absolutely at least be doing some back of the envelope math on this before choosing a strategy.
If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users, and some percentage of those users will become ex-users - you can run a small split test to figure out how many if you don't already have data on that. You'll also lose a direct half day of revenue. This type of thing will happen somewhere between once a year an once every couple of months, as low and high estimates.
Crunch those numbers, and you'll have an order of magnitude estimate of what downtime actually costs you, and what you can actually afford to spend to minimize it. Keep in mind that engineering and ops time costs quite a bit of money, and that you'll be slowing down other feature development by wasting time on HA.
For instance, let's say you're running a game with 1M DAU, and 5M total active users, making $10k per day (not sure if that's reasonable, but let's pretend), and you've figured out that 12h of downtime makes you lose approximately 10% of the users that log in during that period. In that case, 12h of downtime costs you a one-time "fee" of $5k, and also pushes away ~1% of your total users, which will cost you $100 per day as an ongoing "cost".
If we assume this happens exactly once, and that a mitigation strategy would work with 100% effectiveness, then you should be willing to spend up to $100 extra per day to implement that strategy; the $5k up-front loss is not nothing, but we can probably assume it'll get eaten up by engineering time to implement that strategy. If such a strategy would cost significantly more than $100 per day over your current costs, then by pursuing it you're assuming that "oh shit it's all gone to hell!" AWS events are likely to affect you multiple times over the period in question.
I'm not saying these numbers are realistic in any way, or that the method I've shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of it); I'm just saying that whether you pursue a mitigation strategy or not, it's not terribly difficult to ground your decision in numbers. They do tend to be right on the edge of reasonable for a lot of people, so it's worth thinking about them (good) or (better) measuring them.
I agree with the first sentence, uptime is a very nice thing that users will notice and appreciate over time.
However, I strongly disagree with the second sentence. Investing is uptime is not always worth it. Taken to its logical extreme, imagine 2 potential websites. One of them is incredibly useful but only up 80% of the time. The other one is a blank HTML page, but it the most reliable website in history with 0 seconds of downtime in the past 10 years. If I surveyed users of both websites, I think it would be almost unanimous that people preferred the useful website that was up sometimes.
Startups have limited time and resources, and in practice getting 99% uptime is relatively easy, whereas 99.9% uptime is relatively hard. That is a difference of ~7 hours of uptime per month. Yes, it sucks when your website is down, but it also sucks when there are features you can't develop because you don't have the time or your technical infrastructure doesn't allow in order to chase ultra high reliability. Obviously this depends on your industry, IE if you are a payment processor you better have super high uptime or you aren't going to have any customers, but realistically most companies will likely not lose that many customers if they are up >99% of the time.
There's also risks inherent in a more complicated system.
You can engineer a more complicated system with the goal of avoiding downtime, but this added complexity may end up with unexpected corner-cases and cause a net decrease in uptime, at least in the short term.
It's often better to concentrate on improving mean time to repair (MTTR).
That simply can't be true. There is always going to be a point where an extra decimal place of reliability is too costly.