Hacker News new | ask | show | jobs
by dspillett 658 days ago
If they are self-managing all the extras you get with a decent cloud setup (backups, node failover, load distribution and auto-scaling, multi-region or at least multi-DC for availability beyond single node failures, …), they are going to need an infrastructure person as well as that developer. Preferably two so the one isn't effectively on-call 24/7. And for that multi-DC for availability thing: you might need someone (or assign time from existing people) to manage the accounts with your various providers, you won't want tens+ of VPSs from just one provider like that. Of and on backups & failover, you need person-time (and other resources, but the people are probably the expensive part from the business PoV) to regularly test and adjust all of that, so you can be reasonably sure it all works when actually needed. And you need to manage replacing those people when/if they decide to move on to something new, etc…

Also note that a lot of the things you are paying for (CPU cores, traffic, network throughput) in those nodes are shared resources (that Gbit link especially) and/or have “fair use” policies attached to them, and while the same might be true of cloud providers those policies are often either more generous or (perhaps more important from the business stability PoV) at least better defined.

“Cloud” is still expensive compared to buying and managing individual nodes, even if you add in all the above and the things I no doubt forgot to mention, but it does give a lot more than the same cost in individual nodes than this sort of comparison suggests. And sometimes just not having to deal with all that, keeping the business more focused on its core competencies, is worth the extra expense.

In DayJob we use Azure a lot, and sometimes I see the costs of certain things¹² and balk, and we do still have infrastructure people to manage the platform, but overall it works better for us than managing our own resources more directly. We have an extra complication due to our client base (regulated companies like banks and insurers, who are storing PII of both their own people and their customers with us) in that we have to give a lot of assurances on security and such which would be more work (it is already a _lot_ of work as anyone else in that sort of B2B arena can attest) if we self-managed everything.

----

[1] $2,400/yr for SFTP access to a storage account if you need it available 24/7?! Especially given we have at least one such account per client as their requirements understandably require that level of separation. I think we'll keep using the relay & management dashboard I setup in a few cheap VMs, thanks…

[2] and the performance given the costs: AzureSQL³ I'm looking at you!

[3] though again, some of that cost is in things like the scaling flexibility and other infrastructure convenience, which the business finds worth paying for

3 comments

> If they are self-managing all the extras you get with a decent cloud setup (node failover, load distribution and auto-scaling, multi-region or at least multi-DC for availability beyond single node failures, …)

History has proven that most of the time these reduce availability than increase them. Any sort of failover and the complicated setups to get it going introduces bugs and issues more than the redundancy it provides.

Have we forgotten the number of large single server applications running on single linux machines that never needed an unplanned restart or had a crash for years? And you can't beat AWS us-east-1 or Azure or GCP in outages lately.

And I doubt any service like this needs auto-scaling. Most services barely will use up a proper single server i.e. something with >96 cores >1TB of RAM.

> “Cloud” is still expensive compared to buying and managing individual nodes > And sometimes just not having to deal with all that, keeping the business more focused on its core competencies, is worth the extra expense.

There are ways to not manage all that and still be in the cloud. It's called don't use AWS or Azure.

> they are going to need an infrastructure person

No.

I run multi-site Ceph+Nomad clusters with NixOS on Hetzner for our startup and maintaining those takes less than 5% of my time.

By using great tools and understanding them well you can do it with little manpower. I learned all those tools in around 3 months total -- so around as much as getting a basic understanding of AWS IAM ;-)

The only thing you don't get with that from your list is auto-scaling. But the with Hetzner the price difference vs AWS is 10x for storage, 20x for compute, and 10000x for traffic, so we just over-provision a little. And my 5% time /includes/ manual upscaling.

Yes, I am oncall 24/7 to manage that infra, but I'd be as well when using hosted cloud services. Yes, fixing a Ceph issue, or handling Hashicorp Consul not handling an out-of-disk situation correctly is more complicated than waiting for S3 go come back from its outage, but the savings are massive. Testing whether your backup restore works is something you need to do equally with hosted services.

So it is definitely possible to self-manage everything, for 5% of one engineer.

> By using great tools and understanding them well you can do it with little manpower.

“and understanding them well” is doing a lot of legwork there. From a standing start how does a startup that has the skills & experience to make the product but not necessarily manage the infrastructure get to the point of understanding the tools well, or even knowing which tools are best to learn to the point of understanding well?

> So it is definitely possible to self-manage everything, for 5% of one engineer.

I can accept that as true, if you have the right person/people, and they are willing (particularly the on-call part).

I'm in a similar situation; what resources did you find helpful for learning NixOS? Tho I could skip that for now and stick containerized, in which case I just need Nomad..but I'm not certain on picking it over K8s in any case. Just knowing I'm gonna have to deal with this soon and you seem to have it figured out enough!
I found NixOps when searching for an alternative to Ansible that is actually declarative and not just a "bash in yaml" runner. Our Ansible deployments took > 10 minutes and were not "congruent" (well explained in [1]): Removing the Ansible line that installed nginx did not uninstall nginx, so the state on all servers diverged over time and we had no clue what was runing where. Docker was also very slow because changing something early in a Dockerfile leads to lots of re-building, because again it's just bash scripts with snapshotting.

I thought "surely somebody must have invented a better system for this" and NixOps was exactly that. Deploying config changes always took a few seconds with that, instead of 10 minutes.

> what resources did you find helpful for learning NixOS?

This was already in 2017 so documentation was worse than it is today.

On a flight I read the Nix, NixOS, Nixpkgs manuals top to bottom. I also read some of the nix-pills, but didn't like that they went so deep into the weeds of packaging when my primary interest at the time was OS configuration management. In retrospect, I should have read those also front to end to save some time later when packaging our own software and some specific dependencies became more important for us. I also read various blog posts, examples, and asked some questions in the IRC channel (now Matrix), where there were some people that simply knew every detail and were willing to spend hours sharing their knowledge (thanks cleverca22!).

I also read key NixOS logic source code, such as the `switch-to-configuration` script that switches between 2 declarative configs (like many, I do not like that this is written in Perl, and I'm sure it will eventually be switched).

A thing I did wrong was to learn too late how to write my own NixOS modules; I wrote our own systems as "plain nix functions" but they would have been better as NixOS modules, because those allow overriding parts of the config from outside, and make code more composable (see also https://news.ycombinator.com/item?id=41355203).

I spent 2 months prototyping all our infra in NixOps and learned by doing.

I also learned specifically where the gaps are: NixOS generally handles what's running on a single machine (with systemd units), and with e.g. NixOps you can access the global config of other machines (to render e.g. a Wireguard config file where you need to put in all machines to connect to, so {all machines IPs} \ {own IP}). It does not handle active cross-machine coordination, e.g. if some GlusterFS or Ceph tutorial says "first run this command on this machine, then afterwards that command on that other machine", or "run this command on any machine, but only run it once". So I learned Consul as a distributed lock service to coordinate (mutex) commands across machines. Luckily, the amount of software that needs "installation by human operator running commands" is continuously going down, declarative config becomes more of a norm.

With NixOS, a good thing is that while it is reasonably complex, it is simple enough that you can understand it fully, that is, for any given behaviour you _know_ where in the nixpkgs code it is. I recommend to use that approach (spend a few months to understand it fully), because it makes you massively more productive.

I also believe that this is a big benefit of NixOS vs e.g. containers on Kubernetes: Kubernetes is big and complicated, with likely more lines of code than anybody could read, and the mechanisms are more involved (for example, you need to know a lot of iptables to know how a request is routed eventually to your application code). NixOS is simpler (packaging software and rendering systemd units); it uses a more radically different fundament but in turn advanced features on top of it are straightforward (multiple versions of libraries on the same machine, knowing for every binary exactly which source code built it, running _only_ what's declared, automatic transparent build caching, spawning VMs that mimic your physical servers). NixOS provides less than cluster orchestrators like Nomad and Kubernetes (e.g. no multi-machine rolling deploys with automatic rollbacks), but one person can keep it all in their head, and it is very good at building things that run in cluster orchestrators. (Disclosure: I know much more about NixOS than Kubernetes; maybe Kubernetes disagree with me and think that a single person can understand Kubernetes source entirely to get the fast directed debugging I claim is possible with NixOS.)

Often, you also don't need a cluster orchestrator. Our Ceph runs straight on NixOS on Hetzner dedicated machines, it does not run in our Nomad. We use Nomad to schedule our application-specific jobs onto our machines -- that is, we use the cluster orchestrator for their original design goal (ball-packing CPU + memory jobs across machines), and do not use the cluster orchestrator as a "code packaging and deployment tool", which is what much of current Docker+Kubernetes is used for. We find that Nix is simpler and better for the latter.

Starting from NixOps, we Nixified all our our tooling (e.g. build our Haskell / C++ / Python / TypeScript with Nix), fixed things in nixpkgs in our submodule and made lots of upstream PRs for it (I'm currently at ~300 nixpkgs commits). NixOS works extra well if you upstream stuff your company needs, because it will reduce your maintenance burden and make other industrial users' life easier too. Especially recommended is to upstream NixOS VM tests for services you rely on; for example, I contributed the Consul multi-machine VM test [2], which automatically runs for any version upgrade to Consul in nixpkgs so nobody will break our infra that way.

Hope this helps!

[1]: https://flyingcircus.io/en/about-us/blog-news/details-view/t...

[2]: https://github.com/NixOS/nixpkgs/blob/72936c3bf6272f05922812...

Keep in mind that this is a hobby project that is currently bleeding money. No more money is going to be lost if they lose their DB, don't have backups, go down for a week, etc. So a lot of the things you mention aren't really relevant to this case.

What would you prefer, this website eventually shutting down because the donations barely cover hosting costs and there's nobody to maintain it, or the website occasionally going down but otherwise actually being profitable enough that the founder can continue maintaining it on a part-time basis and keeping the site alive?