Hacker News new | ask | show | jobs
by throwaway823882 1881 days ago
There will not be a massive productivity boost in the next 10 years. We haven't had a productivity boost in the past 20 years. In fact, we're much less productive now. But I digress.

We're still largely making and running software the same way we did 20 years ago. The only major differences are the principles that underpin the most modern best practices.

"What the fuck does that mean," you say? It means that I can take a shitty tool and a shitty piece of infrastructure, and make it objectively more reliable, faster, and easier, by using modern practices. Conversely, I can take the latest tools, infrastructure, etc, and make them run like absolute dogshit, by using the practices of a decade ago.

Let's take K8s for example. It's like somebody decided to take the worst trends in software development and design-by-committee and churned out something that is simultaneously complicated and hard to use [correctly]. Setting it up correctly, and using it correctly (no, Mr. I'm A Javascript Developer, however you think you're supposed to use it, you are wrong), for a medium-scale business, requires banging your head against a hodge-podge of poorly designed interfaces and bolting on add-on software for at least a month.

This is ridiculous, because what you actually needed was something whose principles match those of K8s. You want immutable infrastructure as code. You want scalable architecture. You want (in theory) loosely joined components. Do you need K8s to do that? No! K8s is literally the most complicated way that exists to do that.

Next let's take Terraform. A clunky DSL that is slowly becoming sentient. Core functionality that defeats itself (running terraform apply is unpredictable, due to the nature of the design of the program, even though it was created with the opposite intention). Kludgy interfaces that are a chore to cobble together. Dependency hell. Actual runtime in the 10s of minutes (for the infra of a single product/region/AWS account). And no meaningful idea if the state of the infrastructure matches its state file matches the code that generated it all. To say nothing of continuous integration.

Do you need all of that to do infrastructure as code? Of course not. A shell script and Make could do the same thing.

So, you have a kludgy Configuration Management tool (terraform), and a kludgy Cluster Management tool (k8s), and all just to run your fucking Node.js webserver, make sure it restarts, and make sure a load-balancer is updated when you rotate in new webservers. Did you need all of this, really? No. The end result is functionally equivalent to a couple of shell scripts on some random VMs.

To get the best out of "Software Infrastructure" you don't need to use anything fancy. You just need to use something right.

---

But there's a different problem with infra, and it's big. See, Cloud Infrastructure works like a server. Like a physical server, with a regular old OS. To manage it, first you "spin up" the physical server. Then you "configure" the server with some "configuration management" software. Then you "copy some files to it". Then you "run something on it". And while it's running, you change it, live. This is how all Cloud Infra is managed. This is known as Mutable Infrastructure. Modern best practices have shown that this is really bad, because it leads to lots of problems and accidental-yet-necessary complexity.

So what is the solution? The solution is for cloud vendors to supply interfaces to manage their gear immutably. And that is a really big fucking problem.

Say I have an S3 bucket. Let's say I have a bucket policy, lifecycle policy, ACLs, replication, object-level settings, versioning, etc. The whole 9 yards. Now let's say I make one change to the bucket (a policy change, say). Production breaks. Can I "back out" that change in one swift command, without even thinking about it, and know that it will work? No.

Assuming an AWS API call even has versioning functionality (most don't) I can use some API calls to query the previous version, apply the previous version as the current version, change some other things, and hope that fixes it. But there's no semblance of `rm S3-BUCKET ; cp -f S3-BUCKET-OLD S3-BUCKET`. You have to just manually fix it "live". Usually with some fucked up tool with a state file and a DSL that will randomly break on you.

There is literally no way to manage cloud infrastructure immutably. Cloud vendors will have to re-design literally all of their interfaces for a completely new paradigm. And that's just interfaces. If trying to do the S3 bucket copy fails for "mysterious reasons" (AWS API calls fail all the time, kids), then your immutable change didn't work, and you still have a broken system. So they actually have to change how they run their infrastructure, too, to use operations that are extremely unlikely to "misbehave".

Do you know how much shit that covers? It took them like 20 years to make what we have now. It may take another 10 to 20 years to change it all.

Therefore, for a very, very long time to come, we will be stuck with shitty tools, trying to shittily manage our infrastructure, the way we used to manage web servers in the 90s. Even though we know for a fact that there is a better way. But we literally don't have the time or resources to do that any time soon, because it's all too big.

3 comments

I wonder what you think is the alternative to Kubernetes that makes it easier to deploy a piece of code to run on a cluster. I hope it's not some kludge of cobbled together shell scripts and ssh.

In general these ideas of magical 'immutable' infrastructures seem to pan out much worse than just plain VMs with some orchestration solution inside. Anything that tries to capture the state of the world and reify it usually has to severly limit what you can actually do to avoid the halting problem, or just copy some things and hope for the best in terms of external dependencies being the same.

A Kubernetes cluster already runs on cobbled together shell scripts. The K8s build and setup files, Dockerfiles, in-lined commands in kubectl files. Shell scripts are just small simple programs.

Immutable Infrastructure isn't magical at all, it's a paradigm shift which has real world effects. And you can do it with plain VMs and some software inside. But the VM has to be immutable, and the software inside should be as immutable as possible. Otherwise you are constantly fighting a battle to "fix" the state over and over again, and you can't clearly reason about the operation of the system.

Immutable Infrastructure doesn't solve the halting problem, it avoids the halting problem, because it depends on not having general algorithms and unlimited input. It instead uses specific operations with specific criteria in a predictable way, the way safety systems are designed.

Keep your s3 config in version control, terraform apply, if it doesn't work revert the commit and terraform apply again. The DSL isn't that hard: https://registry.terraform.io/providers/hashicorp/aws/latest...
It would be nice if it worked that way.

What happens if your network is interrupted during apply or your STS credential expires? If you are lucky, you may be able to upload the errored.tfstate and plan/apply again. But if somebody else plan/applies before then, the state will be out of whack. Resources will already exist with names that you want to use, and you will have to manually edit the state file, or import the existing resources, or delete them.

What happens if someone changes the bucket without using Terraform? The bucket's changes are blown away on the next terraform apply. Or the apply will sometimes just fail, and you'll have to fix the state or resource by hand.

What happens if someone has applied a different change after your change? Reverting the commit may now impact something else somebody since removed/added, running into naming conflicts and needing custom code changes, and sometimes moving around of state (happens often when a commit changes whole modules).

Or say a code change changes the provider version. A revert may not work and you may need to remediate the infra and code yet again.

Or say someone applied Terraform with a newer version of Terraform. Now you need to upgrade your version of Terraform, and change whatever code you wanted to revert to the new Terraform version.

Or say the resource + action require delete before create, and your reverted commit tries to do create before delete. You'll need to modify your Terraform code after reverting it with a new lifecycle block, or apply will keep trying to do something impossible.

All of this is happening in production. How much downtime is ok while you fix all of those problems by hand?

> running terraform apply is unpredictable, due to the nature of the design of the program, even though it was created with the opposite intention

Citation needed. Run _with a plan_, terraform apply is entirely predictable as to the operations it will carry out and the order in which it will carry them out.

You are incorrect. edit according to your wording, you are sometimes correct. But terraform plan often can't show values that it can't predict ("Values known after apply"), so you often can't predict what apply will do. And you can never know for sure if the operations will succeed.

First of all, what does terraform apply do? It runs API calls. Those API calls may or may not work, based on criteria outside of the control of Terraform. It may be supposed to work, but that doesn't mean it will. Use AWS enough and you will run into this weekly.

Second, Terraform has semantics to do things like "create before delete" or "delete before create". This behavior is controlled by the provider, per-resource. But Terraform has no earthly fucking clue which it's supposed to do, only the provider knows. And the provider does not check before terraform apply runs whether an operation will conflict with a resource, even if Terraform is managing that resource.

So you run terraform plan -out=terraform.plan, and the whole thing output seems fine! So you terraform apply terraform.plan, and Terraform happily performs an operation which is literally impossible, and then fails. Then you go find the code that has been working just fine for 2 years and change it to fix the lifecycle for this particular scenario, and then you can terraform apply correctly.

Thirdly, a lot of terraform plan runs show output with unknown values, because the values won't be known until after you run terraform apply. But those values can conflict with existing resources with the same values. Terraform makes no attempt to verify if conflicting resources exist before apply. Therefore, your terraform apply can fail unpredictably (to be correct: you could predict it if Terraform were designed to actually look for conflicting resources before applying).

There's about a hundred of these edge cases that I run into every day. I already perform all operations in a non-production environment, but as we all know, the environments naturally drift (because cloud infrastructure is mutable infrastructure) so there are times production breaks in a way that didn't in non-production.

> edit actually, according to your wording, you are correct.

That’s uhh... why I chose that wording.

> Use AWS enough and you will run into this weekly.

I _really_ wish you understood how amusing this exhortation is.

I’m under no illusion about the reliability of AWS APIs - much less Azure or any of the other major public clouds.

The easiest way to reduce drift is to prevent read/write access via anything except a designated service account for Terraform. Per one of your earlier replies about a coworker having upgraded TF underneath you, it sounds like you aren’t doing that.

I’m with you on some aspects of your rant - specifically precious resource consumption such as names - for what it’s worth (and am a former core maintainer of Terraform, and still work with that codebase daily elsewhere), but if you’re going to rant about a project whose maintainers (or former maintainers) may be listening, you’d better be technically correct.

I’d suggest issues (yes I know there are a lot of them, I don’t know how they get triaged anymore) or mailing list posts, or <gasp> contributions as a more helpful way to engage rather than complaining on an internal Slack channel.

Been there. If Mitchell or anyone else on the core team doesn't want a feature to work a certain way, or feels that a bug isn't a bug, or maybe they want to wait until the next big release to look at something, tough luck for us. I suppose I have to become a Terraform developer in my spare time, and then I only need to spend years fighting the corporation that owns it over how to design it to actually function the way humans need it to.

Terraform doesn't look for existing resources. It doesn't properly plan for new values. It doesn't have a convention to define resources that can or can't be used idempotently. It doesn't support modern deployment methods. It doesn't make a best effort to fix things even when it clearly can detect what the solution or user's intent is. It can't auto import resources, even when it knows that something already exists, and it won't remove something in its own state when you're trying to import it, even though you clearly want to replace what's in state with what you're importing. And it's a monolith, so it's not like we can change small parts to fix our use cases.

There's no amount of GitHub issues or mailing lists that will fix all this crap. Some of it is poor functionality, some of it is buggy functionality, and some of it is just bad design. It needs to be redesigned, and its components separated so they can be replaced or improved as needed, without editing Go code or a DSL.

But mainly, it needs to work the way humans need it to work in production. Not the way a bunch of idealistic devs think is theoretically the best way. "Just don't ever touch an AWS account at all by any means other than Terraform. What's so hard about that?" Nobody who has an actual live product used by customers [for money] could possibly survive this way without hours of outages and weeks of delays. If your tool is designed for an impossible scenario, then it's impossible to use it "properly".

> Terraform doesn't look for existing resources.

This is 100% at odds with your stated goal of predictability.

> It doesn't make a best effort to fix things even when it clearly can detect what the solution or user's intent is.

As is this.

> It can't auto import resources, even when it knows that something already exists, and it won't remove something in its own state when you're trying to import it, even though you clearly want to replace what's in state with what you're importing.

I’m noticing a pattern with your stated desires being at odds with your stated goals.

FWIW, I manage dozens of accounts which no one touches outside of TF (or other similar tools) and have worked this way since 2016 with basically no issues.