|
Exactly. A lot of startups are convinced they'll need Google scale from day one. Then, of course, the overwhelming majority fail in the first year. Get a big, reliable, and cheap vhost/server somewhere, use as many "can't really go all that wrong" components like postgres, minio, etc and dockerize everything. If you want to get "fancy" use ZFS and setup some snapshots and backup. Most solutions don't even need 100% uptime. Communicate maintenance windows to customers and you'll be fine with a total of an hour of downtime (or whatever) in the first year. Most big, really complex, early over-engineered and unnecessarily "optimized" solutions have enough footguns you'll probably end up with more unscheduled downtime in the first year anyway. In the rare event the startup really succeeds and customer demand, load, uptime requirements, etc demand it you can throw revenue/funding/etc at a K8s control plane on your favorite hosting provider, use a managed postgres/db/whatever, and S3 compatible object store, etc. Or, if things get really big skip all of that and hire in house talent to manage a couple of racks of leased hardware (same opex as cloud but almost always SUBSTANTIALLY cheaper) in geo redundant/distributed colocation facilities. I've launched multiple startups with this strategy and it's gone very well. My current startups all run from the same big (but 10yr old) hardware that has loooooong since paid for itself even with lots of GPU, storage, etc upgrades over the years. People can be kind of scared of hardware but I've never had downtime or data loss caused by a hardware failure in almost 20 years of this approach. People are always amazed when I do things with ML, TBs of data, lots of bandwidth, etc and I tell them my total hosting costs are $150/mo. |
I'd love to hear more about the setup. I suspect something as simple as disk failure would cause outage, although I suppose you can detect a soon to fail disk via SMART and resolve that with scheduled maintenance/downtime. But what about power supply failure? Do you keep redundant backup parts on hand?
There's definitely something nice about having a hardware error on a cloud VM result in that VM cycling out to new hardware automatically. In contrast, something as simple as buying a new off the shelf PSU feels like a ~1 hour downtime event (longer of you don't have purchase card authority, it's night time, you need to order online, etc.).