| HN Mirror

> Dat soundz like a bank and not a cloud provider.

The first stage in making something reliable, sustainable, and as easy to run as possible is to understand the problem, and understand what you're trying to achieve. You shouldn't be writing any code until you've got that figured out, other than possibly to make sure you understand something you're going to propose.

It's good software engineering, following practices learned, overhauled, and refined over decades, that have a solid track record of success. It's especially vital where you're working on something like AWS, Azure, etc. cloud services.

If you leap feet first in to solving a problem, you'll just end up with something that is unnecessarily painful down the road, even in the near term. It's often quicker to do the proposal, get it reviewed, and then produce the product than it is to dive in and discover all the gotchas as you go along. The process doesn't take too long, either.

Every service in AWS will follow similar practices, and engineers do it often enough that whipping up a proposal becomes second nature and takes very little time. Just writing the proposal in and of itself is valuable because it forces you to think through your plan carefully, and it's rare for engineers not to discover something that needs clarified when they write their plan down. (side-note: All of this paperwork is also invaluable evidence for any promotion that they may be wanting, arguably as much as actually releasing the thing to production). It shouldn't take a day to write a proposal, and you'd only need a couple of meetings a few days apart to do the initial review and final review. Depending on the scope of what came up in the initial review, the final review may be a quick rubber stamp exercise or not even necessary at all.

Where I am now, we've got an additional cross-company group of experienced engineers that can also be called on to review these proposals. They're almost always interesting sessions because it brings in engineers who will have a fresh perspective, rather than ones with preconceived notion based on how things currently are.

An anecdote I've shared in greater detail here before: Years ago we had a service component that needed created from scratch and had to be done right. There was no margin for error. If we'd made a mistake, it would have been disastrous to the service. Given what it was, two engineers learned TLA+, wrote a formal proof, found bugs, and iterated until they got it fixed. Producing the java code from that TLA+ model proved to be fairly trivial because it almost became a fill-in-the-blanks. Once it got to production, it just worked. It cut down what was expected to be a 6 months creation and careful rollout process down to just 4 months, even including time to run things in shadow mode worldwide for a while with very careful monitoring. That component never went wrong, and the operational work for it was just occasional tuning of parameters that had already been identified as needing to be tune-able during design review.

In an ideal world, we'd be able to do something like how Lockheed Martin Corps did for the space shuttles: https://www.fastcompany.com/28121/they-write-right-stuff, but good enough is good enough, and there's ultimately diminishing returns on effort vs value gained.