Hacker News new | ask | show | jobs
by Laaas 657 days ago
How are they managing to spend 7.5k a month in server expenses? Are they using AWS?
6 comments

We're paying ~$1k/month for all our webservers (which is a dozen ECS instances). They're handling about 3,000 requests per second (but that does sometimes massively spike to tens of thousands if not more).

We're paying ~$1k/month for all our tooling servers (so the 150 different test runners, representers, analyzers that are used to check peoples code. There's >=1 of those running every second). Bare in mind, we're running student's code in over 70 languages here. Each is a docker container (often many gb large) - so we pay for HDD too.

The biggest actual cost is the database at $2k/month. We have about 600 queries per second, and around 10MB per second (spiking to 47MB per second) of read throughput. It's an autoscaling database, but AWS determines that it's at the level it needs to be, and if I turn that down, performance suffers (I've tried).

Beyond that, all the other individual services are ~$300/m, so quite small amounts, but for things we rely on (e.g. caching servers, a shared filesystem amongst all those servers, and other things).

$1.2k on tax is also fun.

Thank you so much for sharing your costs! I was very curious to see what the real expenses for AWS services look like, as I've always assumed AWS is massively overpriced.

From the numbers you've provided, it seems like your total cost is around $5.5k, so I assume the remaining $2k is attributed to traffic.

Everything looks quite reasonable, except for the database and traffic costs. I've run MySQL servers handling 150k reads/sec and 50k updates/sec with no issues, even on very cheap machines (around €30 per month). Years ago, we were serving over 100 million pages (of heavy content) per month, and we didn’t even bother looking at traffic statistics because, here in Germany, it’s hard to hit the traffic limits that most hosting providers impose.

That being said, AWS is less expensive than I initially thought. At the same time, I’m confident you could reduce your hosting bill by up to $2k without even leaving AWS by setting up your own database server. Moving away from AWS entirely might be challenging, as managing a fleet of about 30 servers would likely take one or two days of work per week (I'm managing a dozen mostly idling servers and I work one day per month on them). When your hosting bill reaches $30k, I'm very sure it would be cheaper to hire someone (hint, hint ;-)), that moves everything to dedicated servers and manages them.

I was curious, so I just checked on one of our customers (I work at a small MSP in the UK) by way of comparison and we have on chugging along happily at more than double those numbers on a $288 Linode dedicated CPU instance. And we're only on that size for ease of disk space handling as the database is several hundred GB. CPU is basically at zero, it's the disk IO that actually gets you on some of these busier databases (from my experience).

RDS is extremely expensive. All managed databases are.

That said, it's a trade off of convenience and being in the AWS bubble, and weighing up the pros/cons of separating out services. Data Transfer is another thing to consider too of course. Sticking your database elsewhere might cost more in egress traffic communicating with it from your other AWS infrastructure. If you're all in on other AWS services, sometimes the RDS price is just worth it when it comes to the total price. Sound like this might be the case for your setup.

I hope you do manage to work things out. The service you have is great.

PS - Side note on RDS sizing. You might already know but sometimes it's worth increasing the storage size on gp3 type storage above 400GB (if you haven't already) as you get 12,000 IOPS baseline against 500MiB/s throughput[0] when you have that much storage. That's 4 times the below 400GB baseline performance but you only pay for the additional storage cost. It can make a difference if you're IOPS constrained or trying to deal with bursty traffic but want to use the smallest instance size possible otherwise to save costs.

[0]https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_...

A part of me is tempted to say 'maybe some cost reduction in cloud bill is possible' but for the scale you operate at and cost you already have, I feel like refactoring the revenue model is the greater strategic 'bang for your buck'
I don't want to armchair ops your decisions, but maybe do consider moving to some dedicated Hetzner servers, at least for the runners. You can probably reduce that cost tenfold relatively simply.
Move everything to Hetzner (or the like) at double the capacity that AWS has on average and you still spend about 10% of what you spend on AWS.

That is how stupid costly AWS is and 99% of people using AWS are wasting money like there is now tomorrow.

Problem is Hetzner has such stringent user verification its tough to get hosting on there.

You don't even need to use Hetzner, I'm actually quite surprised they are spending 2k on DB. For the env isolation i understand they need ec2 but...

it shouldn't cost anywhere that amount, for instance I have 3 million users on a $40/month digital ocean VPS with just Django + Postgres and I have far more reads and writes.

for env isolation you could spin up a $5/month VPS and shut it down when idle

Sure Hetzner ist just one example. Many provides sell at about 5 to 10% of AWS prices.
Eh? It's not "tough" to get hosting there... It's super easy if you use legit details and live in a country they are allowed to do business with.
> 99% of people using AWS are wasting money like there is now tomorrow

Why do you think that is? Are that many people that dumb or are there other reasons they use managed cloud?

AWS will charge for network traffic, both in and out. Would need to calculate what is cheaper.
The runners take some code and return text, right? That shouldn't be too much traffic, hopefully.
Not to be that guy, but I'm pretty sure that these volumes (talking about ECS and database here) could be handled quite well by a few dedicated $100/mo servers, provided the code isn't hugely unoptimized. So I'm sure you could save (very conservatively) half your budget when going on-prem. That said, it probably won't help you much overall, I would imagine.
I've mentioned this before on HN, but we had a single postgres machine primary read/write server that did 4k QPS 24/7. It had DDR ram as storage on a PCIe card, of course, but this was before "SSD" was a thing. It was for a site that hosted portfolios of images, for both people in the images and people who took the images, and such. The front end data (the images and text) was, iirc, 3TB. Sometimes we'd need a server in a new location, so a locked metal briefcase was carried from the DC where the front-end data lived, to our offices, where one of our "IT" people would then carry it on to the new location and offload it to those servers in that location.

Anyhow that database server was probably ~$35,000 all in. That's 5 months of your current AWS spend. One of the things i did during that time was take a 2 generation newer server, a $35,000 1u Dell with 512GB of ram, and mirrored the postgres database into tmpfs and enabled replication, then we set that machine as primary. The new machine didn't break a sweat. So much so that one of the things me and the (really very awesome and nice; Hi, Chuck, if you're out there!) DBA did was set postgres to use no more than 640KB of memory, then ran the entire site, with 4k QPS, on that postgres instance with 640KB of memory (not counting the 280GB of tmpfs storage, of course!), just to prove it would work. It did - although some of the bookeeping queries (not sure what they're called) were taking a very long time, and would have had to be refactored to use less temporary memory, and such.

anyhow my point is, there are people out there that can do things cheaper, or faster, or more efficiently than whatever you got goin on right now. Your statistics on "per second" usage and the like don't sound too demanding. If you could squirrel away $500/month for a few months, and you ask around for someone that can rack metal and has peering, there are people (including me) who could get you co-lo in <16U[0] with redundancy, where your only monthly infra charges would be the co-lo fees.

[0] Old, extremely beefy, but large servers are generally 4U, but dirt cheap for what you get. Ex: 80 thread, 512GB RAM, 8 SAS bay HP server, $800 shipped. And i bought those 6 years ago. However: 5950x, 128GB RAM, 24 SATA port can be had for <$2000 (i'm guessing based on what i paid a few years ago), and that's roughly equivalent in power (kernel compile takes 3 seconds longer on the 5950x but it uses 1/4th the power at the wall). The reason i tagged on <16U is because at most you're gunna need 4x4U, two "front end" and two "back end" machines, with duties split and everything redundant in the rack. I haven't looked in a while to see what's available on ebay as far as more density, but for sure 16U or less!

The issue becomes: how do you find someone who knows how to do all that, that is willing to work for next to nothing because they believe in the ngo/nfp? Maybe there's a tech forum that people like that read, who knows.

good luck, and thank you for doing things to help other people. I hope it all works out in the end.

email in profile.

RAM as storage for DB? So data loss on reboot? Sounds like very specific use case.
https://en.wikipedia.org/wiki/Fusion-io

maybe i misspoke - at one point there was battery backed DDR ram they used in PCIe, but by the time i came around they were using Fusion IO PCIe devices, which i guess were NAND flash, not DDR. or, alternatively, that is how it was explained during onboarding - "it's like DDR on a PCIe card, so the iops are 1000x that of SAS 10k drives"

unless you're talking about our experiment of tmpfs - then yeah, the use case was "genewitch heard bill gates say 640k should be enough for anyone; here's a super beefy machine to test that theory; theory tested." We didn't run the site live on that machine for more than 10 minutes or so, we switched it back to the fusion-io backed server immediately. It was a proof of concept about one of the things we could do with these new servers - read replicas with the DB in tmpfs for extreme speed and no IO blocking.

It sounds like a lot, but even if they dropped server expenses to zero they'd only have $90k/year to spend.

That is not much, considering they need to cover all aspects of running a nonprofit; they need enough technical staff to maintain a 24/7 on call rota while letting people occasionally go on holiday; and they need to pay healthcare and pensions and suchlike on top of salaries.

And it's not like they can offer a meagre salary now but promise vast wealth in the future, like a startup might.

That is easy to turn around, if the company would do smarter and cheaper hosting we would ask why they didn't spend more time into finding a viable business model.
Cloud providers have gotten really good at selling scale anxiety. Very easy to reach $10k/mo when building on a modern web stack using managed services. Kafka alone could get you there once you factor in multiple environments and add-ons.
That's only around $10/hour, which doesn't actually buy one a whole lot of servers/databases/bandwidth/monitoring/logs/etc.
The reliability/uptime guarantees of the cloud providers are dubious, but in this case I don't think they even need to be discussed: this product makes no profit. No money is going to be lost if the thing goes down, because it already doesn't make profit. In fact, just keeping the thing up is making them lose money, so short of completely shutting it down, moving it to more cost-effective hosting would at least mean they can keep it going for longer on their donations.
>> which doesn't actually buy one a whole lot of servers/databases/bandwidth/monitoring/logs/etc.

It buys you 150 machines for a month 12 core 24GB VPS with unlimited traffic on a 1Gbps link

See my comment below.

And then you get to stand up your own databases, load balancers, monitoring, logging, etc. For which you need a development team with significant operations experience to do correctly - who will surely cost you more than $90k/year

I get it, AWS looks expensive, but a bunch of their foundational services are real force-multipliers if you don't have the cash to build out entire operational teams.

>> if you don't have the cash to build out entire operational teams

Its just not true that AWS doesn't need expensive experts to get stuff done - it really does.

Anyone who is half decent on the command line in Linux can get all those servers installed and running without the spaghetti complexity of AWS.

The cloud as a magical place of simplicity and ease of use and infinite scalability in every direction - I think its the opposite of that - AWS is a nightmarish tangle of complexity and hard to configure, understand, relate and maintain systems.

Its MUCH easier just to load up a single powerful machine with everything you need. I'm not saying that works for all workloads but a single machine or a few machines can take you an awful long way.

> Its MUCH easier just to load up a single powerful machine with everything you need. I'm not saying that works for all workloads but a single machine or a few machines can take you an awful long way.

For the core service I tend to favour monoliths too, but I would say you are vastly underestimating the halo of other crap needed to operationalise a real website/SaaS.

Where is your load balancer? Your database redundancy? Where are backups stored? Where are you streaming your logs for long-term retention? Where are you handling metrics/alarming?

Bare metal is great, but you have to build a ton of shit to actually ship product.

> Where is your load balancer? Your database redundancy? Where are backups stored? Where are you streaming your logs for long-term retention? Where are you handling metrics/alarming?

What we lose sight of, is that those things aren't as important as we, as SREs would like to think. When you're a corporation of one person, trying to stay afloat, you can just rely on a single big box and spend your time dealing with all the other problems first. Make sure you have an escape hatch so you can scale up if need be, but don't overengineer for a problem you won't run in to.

> Your database redundancy? Where are backups stored? Where are you streaming your logs for long-term retention? Where are you handling metrics/alarming?

Who cares? At this point it's a hobby project that makes zero profit and is bleeding money. No more money is going to be lost if they lose the DB tomorrow. No more money is going to be lost if they go down for an hour or a day or a week (in fact, they might _save_ money if they don't get more AWS charges during the outage).

They have nothing to lose, and about 6k/month to gain by moving to cost-effective hosting, which could actually make this a decent side-project.

Have managed all that in the past with bare metal and more (you forgot configuring routers, installing OS, managing upgrades, dealing with hardware swaps, etc). Its soooo much more sane to deal with that than AWS configuration, actually relatively easy for someone half competent. Luckily we can just employ people to mess around full time with AWS.
> And then you get to stand up your own databases, load balancers, monitoring, logging, etc.

You get all if not most of this on DigitalOcean, Linode, Upcloud, Scaleway, etc all of a LOT cheaper.

> but a bunch of their foundational services are real force-multipliers if you don't have the cash to build out entire operational teams.

No, it's not. As above and for a lot of things AWS' complexity and silly factor can make it even worse. In GCP I can setup a dual region bucket. As simple as that. In AWS I need to setup 2 buckets, a replication role, bucket policies, lifecycle policies and a lot more just to get the same. Force multiplier? As in make it slower? EKS takes longer than the default Terraform timeout to provision. The list goes on...

I think a lot of folks make their lives unnecessarily complicated by trying to do things on AWS in an explicilty non-AWS way (I inherited a startup codebase last year that did this to themselves in spades).

Why go to the trouble of running Kubernetes on top of AWS, when ECS does roughly the same job at a fraction of the complexity?

Why use Terraform when CloudFormation maps better to the underlying primitives?

> Why go to the trouble of running Kubernetes on top of AWS, when ECS does roughly the same job at a fraction of the complexity?

Because it doesn't. ECS has its own complexities - perhaps as a result of EC2 fleets / autoscaling groups and more. Suddenly you need launch templates and it goes on. Have you tried updating ECS via CLI? It's largely confusing.

> Why use Terraform when CloudFormation maps better to the underlying primitives?

If only. It has gotten better, but historically CloudFormation has a lot of missing features and still do. Cloudformation can get stuck for hours and you'd just have to wait. The cross region support is terrible. Not to say Terraform doesn't have its quirks, but it's definitely not "worse".

Vendor lockin. Knowledge transfer.

And I disagree that AWS is less complex. Managing services across AWS is complex, K8s is as well but I would rather manage K8S on bare instances.

Less vendor lockin.

My devex team has developed helm charts we can use that automatically detect EKS, AKS or gcp K8s and configure the parts of an app to work with each environment, but the end users of the helm chart don’t really have to care.

> Are they using AWS?

Yes!

You’d think they’d use a cheaper hosting service if that’s literally the only thing preventing them from having positive gross profit.
> if that’s literally the only thing preventing them from having positive gross profit

It’s clearly not if the founder isn’t taking a salary and they just had to lay off their only employee..

honest questions, is there any ECS-like hosting that's much cheaper?
Docker Swarm. After building a whole CD PaaS for ECS, I came to believe that swarm is a much more reasonable place for teams to start, so I built a free tool for deploying single-machine swarms called Rove. Do your own research though, and remember you don't have to use one tool or platform for everything.
> is there any ECS-like hosting that's much cheaper?

Most have adopted EKS-like services i.e. kubernetes.

There is fly.io that's closer. Hope they improve on the reliability aspect.

for compute of containerized payloads, in house servers is a no brainer for cost.

almos zero sysadmin troubles. might even reduce the troubles of working with eks/ecs.

now for storage and db, that's a different story.

$50/month gets you a 12 core 24GB VPS with unlimited traffic on a 1Gbps link on Ionos.

https://www.ionos.com/servers/vps

$7500/month would get them 150 such servers.

Maybe they should cut the AWS costs and hire their developer back.

Id be really interested to hear the breakdown of their AWS bill. It would be a crime if they were blowing what money they have giving Amazon 9 cents per gigabyte egress.

$7500/mo is less than half of one headcount --- if you could get all hosting for free. This is a sideshow. With these numbers, their viability is not determined by hosting costs.

The problem here is that we all have opinions about hosting, but not so many useful opinions about business models, so hosting feedback is what this person is going to get.

$7500/mo easily hires two great 10 yr experience developers in most parts of the world.
I don't think this has much to do with what I said, and is rather a grievance comment about how much cheaper programmers are abroad from the US. But, if you really believe this, you should be outcompeting a lot of US tech companies with this strat.
Maybe I'm out of touch with how things have changed over the last year or 2, but 2 years ago you'd struggle to find 1 great dev with 10 YoE for that price ($90K/yr)

Even if things have changed now, I can't imagine you'd find a great dev with 10 YoE for less than $80K/yr, and that's hiring globally (with the time zone issues that come with it). You can probably get 1 OK dev and 1 bad dev with 10 YoE for that price though, but you'd usually be better off just hiring one great dev than any other combination.

It's not $90k/yr, it's something substantially less, because the fully loaded cost of an employee is much higher than their nominal salary.
You are out of touch with most parts of the world. Average income worldwide is about $10K a year. So 90K it about 9 times average income and in many parts of the world hires the top 10% of developers of that country.

But It will probably hire less then half of a US developer that then pisses all the other money away on overpriced AWS Servers, so that is right.

If they are self-managing all the extras you get with a decent cloud setup (backups, node failover, load distribution and auto-scaling, multi-region or at least multi-DC for availability beyond single node failures, …), they are going to need an infrastructure person as well as that developer. Preferably two so the one isn't effectively on-call 24/7. And for that multi-DC for availability thing: you might need someone (or assign time from existing people) to manage the accounts with your various providers, you won't want tens+ of VPSs from just one provider like that. Of and on backups & failover, you need person-time (and other resources, but the people are probably the expensive part from the business PoV) to regularly test and adjust all of that, so you can be reasonably sure it all works when actually needed. And you need to manage replacing those people when/if they decide to move on to something new, etc…

Also note that a lot of the things you are paying for (CPU cores, traffic, network throughput) in those nodes are shared resources (that Gbit link especially) and/or have “fair use” policies attached to them, and while the same might be true of cloud providers those policies are often either more generous or (perhaps more important from the business stability PoV) at least better defined.

“Cloud” is still expensive compared to buying and managing individual nodes, even if you add in all the above and the things I no doubt forgot to mention, but it does give a lot more than the same cost in individual nodes than this sort of comparison suggests. And sometimes just not having to deal with all that, keeping the business more focused on its core competencies, is worth the extra expense.

In DayJob we use Azure a lot, and sometimes I see the costs of certain things¹² and balk, and we do still have infrastructure people to manage the platform, but overall it works better for us than managing our own resources more directly. We have an extra complication due to our client base (regulated companies like banks and insurers, who are storing PII of both their own people and their customers with us) in that we have to give a lot of assurances on security and such which would be more work (it is already a _lot_ of work as anyone else in that sort of B2B arena can attest) if we self-managed everything.

----

[1] $2,400/yr for SFTP access to a storage account if you need it available 24/7?! Especially given we have at least one such account per client as their requirements understandably require that level of separation. I think we'll keep using the relay & management dashboard I setup in a few cheap VMs, thanks…

[2] and the performance given the costs: AzureSQL³ I'm looking at you!

[3] though again, some of that cost is in things like the scaling flexibility and other infrastructure convenience, which the business finds worth paying for

> If they are self-managing all the extras you get with a decent cloud setup (node failover, load distribution and auto-scaling, multi-region or at least multi-DC for availability beyond single node failures, …)

History has proven that most of the time these reduce availability than increase them. Any sort of failover and the complicated setups to get it going introduces bugs and issues more than the redundancy it provides.

Have we forgotten the number of large single server applications running on single linux machines that never needed an unplanned restart or had a crash for years? And you can't beat AWS us-east-1 or Azure or GCP in outages lately.

And I doubt any service like this needs auto-scaling. Most services barely will use up a proper single server i.e. something with >96 cores >1TB of RAM.

> “Cloud” is still expensive compared to buying and managing individual nodes > And sometimes just not having to deal with all that, keeping the business more focused on its core competencies, is worth the extra expense.

There are ways to not manage all that and still be in the cloud. It's called don't use AWS or Azure.

> they are going to need an infrastructure person

No.

I run multi-site Ceph+Nomad clusters with NixOS on Hetzner for our startup and maintaining those takes less than 5% of my time.

By using great tools and understanding them well you can do it with little manpower. I learned all those tools in around 3 months total -- so around as much as getting a basic understanding of AWS IAM ;-)

The only thing you don't get with that from your list is auto-scaling. But the with Hetzner the price difference vs AWS is 10x for storage, 20x for compute, and 10000x for traffic, so we just over-provision a little. And my 5% time /includes/ manual upscaling.

Yes, I am oncall 24/7 to manage that infra, but I'd be as well when using hosted cloud services. Yes, fixing a Ceph issue, or handling Hashicorp Consul not handling an out-of-disk situation correctly is more complicated than waiting for S3 go come back from its outage, but the savings are massive. Testing whether your backup restore works is something you need to do equally with hosted services.

So it is definitely possible to self-manage everything, for 5% of one engineer.

> By using great tools and understanding them well you can do it with little manpower.

“and understanding them well” is doing a lot of legwork there. From a standing start how does a startup that has the skills & experience to make the product but not necessarily manage the infrastructure get to the point of understanding the tools well, or even knowing which tools are best to learn to the point of understanding well?

> So it is definitely possible to self-manage everything, for 5% of one engineer.

I can accept that as true, if you have the right person/people, and they are willing (particularly the on-call part).

I'm in a similar situation; what resources did you find helpful for learning NixOS? Tho I could skip that for now and stick containerized, in which case I just need Nomad..but I'm not certain on picking it over K8s in any case. Just knowing I'm gonna have to deal with this soon and you seem to have it figured out enough!
I found NixOps when searching for an alternative to Ansible that is actually declarative and not just a "bash in yaml" runner. Our Ansible deployments took > 10 minutes and were not "congruent" (well explained in [1]): Removing the Ansible line that installed nginx did not uninstall nginx, so the state on all servers diverged over time and we had no clue what was runing where. Docker was also very slow because changing something early in a Dockerfile leads to lots of re-building, because again it's just bash scripts with snapshotting.

I thought "surely somebody must have invented a better system for this" and NixOps was exactly that. Deploying config changes always took a few seconds with that, instead of 10 minutes.

> what resources did you find helpful for learning NixOS?

This was already in 2017 so documentation was worse than it is today.

On a flight I read the Nix, NixOS, Nixpkgs manuals top to bottom. I also read some of the nix-pills, but didn't like that they went so deep into the weeds of packaging when my primary interest at the time was OS configuration management. In retrospect, I should have read those also front to end to save some time later when packaging our own software and some specific dependencies became more important for us. I also read various blog posts, examples, and asked some questions in the IRC channel (now Matrix), where there were some people that simply knew every detail and were willing to spend hours sharing their knowledge (thanks cleverca22!).

I also read key NixOS logic source code, such as the `switch-to-configuration` script that switches between 2 declarative configs (like many, I do not like that this is written in Perl, and I'm sure it will eventually be switched).

A thing I did wrong was to learn too late how to write my own NixOS modules; I wrote our own systems as "plain nix functions" but they would have been better as NixOS modules, because those allow overriding parts of the config from outside, and make code more composable (see also https://news.ycombinator.com/item?id=41355203).

I spent 2 months prototyping all our infra in NixOps and learned by doing.

I also learned specifically where the gaps are: NixOS generally handles what's running on a single machine (with systemd units), and with e.g. NixOps you can access the global config of other machines (to render e.g. a Wireguard config file where you need to put in all machines to connect to, so {all machines IPs} \ {own IP}). It does not handle active cross-machine coordination, e.g. if some GlusterFS or Ceph tutorial says "first run this command on this machine, then afterwards that command on that other machine", or "run this command on any machine, but only run it once". So I learned Consul as a distributed lock service to coordinate (mutex) commands across machines. Luckily, the amount of software that needs "installation by human operator running commands" is continuously going down, declarative config becomes more of a norm.

With NixOS, a good thing is that while it is reasonably complex, it is simple enough that you can understand it fully, that is, for any given behaviour you _know_ where in the nixpkgs code it is. I recommend to use that approach (spend a few months to understand it fully), because it makes you massively more productive.

I also believe that this is a big benefit of NixOS vs e.g. containers on Kubernetes: Kubernetes is big and complicated, with likely more lines of code than anybody could read, and the mechanisms are more involved (for example, you need to know a lot of iptables to know how a request is routed eventually to your application code). NixOS is simpler (packaging software and rendering systemd units); it uses a more radically different fundament but in turn advanced features on top of it are straightforward (multiple versions of libraries on the same machine, knowing for every binary exactly which source code built it, running _only_ what's declared, automatic transparent build caching, spawning VMs that mimic your physical servers). NixOS provides less than cluster orchestrators like Nomad and Kubernetes (e.g. no multi-machine rolling deploys with automatic rollbacks), but one person can keep it all in their head, and it is very good at building things that run in cluster orchestrators. (Disclosure: I know much more about NixOS than Kubernetes; maybe Kubernetes disagree with me and think that a single person can understand Kubernetes source entirely to get the fast directed debugging I claim is possible with NixOS.)

Often, you also don't need a cluster orchestrator. Our Ceph runs straight on NixOS on Hetzner dedicated machines, it does not run in our Nomad. We use Nomad to schedule our application-specific jobs onto our machines -- that is, we use the cluster orchestrator for their original design goal (ball-packing CPU + memory jobs across machines), and do not use the cluster orchestrator as a "code packaging and deployment tool", which is what much of current Docker+Kubernetes is used for. We find that Nix is simpler and better for the latter.

Starting from NixOps, we Nixified all our our tooling (e.g. build our Haskell / C++ / Python / TypeScript with Nix), fixed things in nixpkgs in our submodule and made lots of upstream PRs for it (I'm currently at ~300 nixpkgs commits). NixOS works extra well if you upstream stuff your company needs, because it will reduce your maintenance burden and make other industrial users' life easier too. Especially recommended is to upstream NixOS VM tests for services you rely on; for example, I contributed the Consul multi-machine VM test [2], which automatically runs for any version upgrade to Consul in nixpkgs so nobody will break our infra that way.

Hope this helps!

[1]: https://flyingcircus.io/en/about-us/blog-news/details-view/t...

[2]: https://github.com/NixOS/nixpkgs/blob/72936c3bf6272f05922812...

Keep in mind that this is a hobby project that is currently bleeding money. No more money is going to be lost if they lose their DB, don't have backups, go down for a week, etc. So a lot of the things you mention aren't really relevant to this case.

What would you prefer, this website eventually shutting down because the donations barely cover hosting costs and there's nobody to maintain it, or the website occasionally going down but otherwise actually being profitable enough that the founder can continue maintaining it on a part-time basis and keeping the site alive?