Hacker News new | ask | show | jobs
by mfer 2128 days ago
One consequence is the accidental deletion of AWS things...

If a CRD is deleted the CRs described it are also deleted. So, deleting a CRD (even accidentally) could end up deleting resources in AWS (e.g., backups). So, be careful.

Some things being managed by Kubernetes would be really cool. Other things being managed by k8s could break things if something goes wrong. I would plan accordingly.

5 comments

Hi I'm a developer advocate in the AWS Container organization. This question of how to handle resource destruction is an active issue on the project, and we'd welcome your comments (or those of anyone else) on this Github issue: https://github.com/aws/aws-controllers-k8s/issues/82

Our goal is to make this project have "no surprises" and therefore no unexpected destruction of resources. The specifics of how we mark resource as safe to delete instead of retaining by default are under discussion on that Github issue.

One way our team has dealt with this question in a highly serverless-oriented and infrastructure-as-code-driven (how's THAT for buzzword soup) environment is to explicitly separate stateful resources from stateless, while exposing reference hooks in a configuration store to cross from one to the other. We've found that doing so _greatly_ reduces the blast radius of mistakes and lets us move more quickly and confidently.

The stateless stacks generally have a lot of development activity going on, and rapidly iterate. This is where most of our code and logic lives. This is where the vast majority of our deployment (and related cloud configuration) activity happens.

All of that thrash is kept away from the stateful stacks - think S3 buckets or DynamoDB tables - where, if THOSE thrash, we potentially get an outage at best, or lose data at worst (backups notwithstanding).

We DO NOT WANT stateless oriented stacks to own the lifecycle for stateful stacks. They inherently need to be treated differently. Or, at least the impact of mistakes is different.

The trick comes when you need to tie them together. To do this, we've added CloudFormation hooks and other deployment time logic that publish ARN and other connectivity info to our configuration store. The stateless services look up config values either during deployment or at runtime and are able to find the details they need to reference the state resources they need access to.

We've poked at toolsets like Amplify that lump everything together and have already been bitten numerous times. We've found that the difference between stateful and stateless resources should not be papered over, but instead emphasized and supported explicitly by tooling.

... all of this being one team's experience over the years, of course.

Very curious to see how this paradigm evolves here!

[edit]… Riffing on this just a little bit further… as I’m thinking about it here, it comes down to abstraction level. In a deployment or resource management domain, a generic “this is a cloud resource” isn’t very useful. What’s way _more_ useful is something like “this is a stateful resource” or “this is a stateless resource”, because that level describes resource behavior more clearly, AND how to interface with or manage those resources.

There are echos of code development principles here intentionally - robust cloud infrastructure management is mirrors software dev practices as much as infrastructure management ones!

Went a very sim route with Serverless and Terraform. Things we couldn't afford to lose went into Terraform(RDS instances, Kinesis, etc). AWS resources that were more coupled to the individual services and could be deleted and recreated went into the serverless configs.

A large reason for this was that we couldn't trust CloudFormation or Lambda to not require blasting away and recreating resources. At least at the time it was not uncommon for a stack to get "stuck" or a lambda function to stop having it's EINs properly configured(essentially stuck).

I sincerely hope this works well. CloudFormation has already many issues with unintended changes, visibility of the changes to be done, auto update/deletion of resources, etc. Having another path to provision infrastructure that is safe, consistent, and provides visibility at all phases would be great.
Also, not all AWS follow the same deletion semantics. Example: S3 buckets. The report as being deleted somewhat quickly, but their name may not be available again for hour or so.

In this case the delete will appear to succeed, but the recreation, if done with the same name, may fail.

Great example. I worry about adding a layer of abstraction over provisioning resources this way.

Of course we have to try this because it's a badass (tho obvious in hindsight) idea, but in practice it might have some downsides.

Does the create during this time window return a specific enough error? This seems like the exact case where a controller that never gives up could provide value. Though I'm kind of amazed this is on the order of an hour instead of minutes.
This issue already exists with, say, Terraform to orchestrate infra with code. The solution is to append a random hex string to the resource unique identifier.

This AWS project will need to support a feature like that.

Exactly. Planning accordingly is the right answer.

Assume anyone can destroy your infrastructure at any time (by mistake or otherwise). This could be done with cloudformation, terraform, API calls, and essentially any automation(with different levels of safeguards).

Be prepared for that. Be careful with your data. Not so careful with individual servers - they should be cattle, not pets.

EDIT: If this is a production system, one could take away any 'delete' permissions until they are needed again.

"cattle, not pets" is a nice slogan, but I think most ranchers would be angry if a junior ranch hand accidentally poisoned the entire herd with a typo.
I think you have mostly the same problems in Terraform, Pulumi, or Cloudformation though right? Is there anything that makes it easier to accidentally do in k8s?

One layer of defense in all of these cases is keeping the IAM credentials that the configuration management tool uses from having any deletion permissions.

That’s an excellent idea. I never thought about that. In hindsight, it makes perfect sense.

Also, set the DeletionPolicy in CF to false.

There are certain things that freak me out in IAC, like loosing a primary DB or secrets. I want IAC to create them, wire things together and index them, but the idea of an accidental mis-configuration deleting a production database gives me heartburn.
That's what you should have QA pipeline ready to check before you'll be able to apply the config, right? Secrets should be encrypted in git to my best awareness.