Hacker News new | ask | show | jobs
by alwaysanon 1315 days ago
The author was doing the right thing with explaining it only makes sense when you know the history of how we got to where we are - but there is a bit more to it. It is all a series of band-aids. Band-aids all the way down. As somebody who has been involved with AWS since 2012 I've seen them all get added incrementally in response to explosions in usage and complexity and customer unhappiness and frustration.

Explicit allows being all you can do in an IAM policy were easy(ish) when there was a handful of AWS services and API actions. As there were more and more services and policy actions etc. they became unwieldy. Enter Permission Boundaries where you could wrap a few explicit denies around them. Kubernetes RBAC is nearly at the same place and could now really use those too - but I digress.

Also, early on in my AWS journey, even two accounts (one non-prod and one prod) was only done half the time and viewed as a best practice to think about - people genrally just opened one AWS account. But when IAM wasn't enough (i.e. there wasn't enough granularity on the resources or the conditions exposed etc.) the answer became separate AWS accounts as the only, or at least the easiest, way enforce these authorization boundaries/separations with a blunt instrument you could trust. It also helped to keep your bills straight before they would do things like break them down by Tag.

Then you often needed cross-account role assumptions to deal with the inevitable cases where things or people needed access between these accounts.

Then the explosion in AWS Accounts led to AWS Organizations to provision and manage them all. And it built in Service Control Policies and OUs as a tool/layer to help further manage/constrain IAM policies/permissions (IAM policies, Permission Boundaries and SCPs now being in a venn diagram with each other these days).

But AWS Organizations managing heaps of accounts was also too painful to use and get right and so they brought in AWS Control Tower to help make setting up Organizations easier.

So, in short, this all makes sense when you understand the inability to totally rewrite/refactor important complex systems used by customers (breaking backward compatibility) and instead trying to keep solving all the challenges with an steady stream of incremental band-aids that you can announce at re:Invent.

7 comments

What in your opinion is a good/best implementation then? We're currently engaged in re-designing our authorization flow, and I was planning to use a model quite similar to AWS IAM (policies, etc.). Contrary to this article, I actually thought the IAM model was simple in that we can tie each entity's access down to a set of policies that we can independently develop, store, and process for providing access.
If you are developing a simple service that only needs access to an S3 bucket and a DynamoDB table etc. at runtime then it is pretty straightforward to write the explicit allows to the right resources afforded in the 'base' AWS IAM.

Where things usually get tricky is for the CI/CD pipelines and/or the administrative users that need way more access. It is very hard to scope to true least privilege - including for things like lots of ECS/Fargate where you don't want people to mess with each other's Task Definitions/Tasks/Services hosted within the same AWS account for one example. The various AWS services are very hit-and-miss for how well you can scope resources and what conditions they offer you.

Security will say "no resource star" which is a best practice but quite difficult to get right in most larger accounts. Permission Boundaries help in being able to flip the conversation to "lets list all the things these users/pipelines shouldn't be able to do instead of what they should to then constrain the more wildcard-y/star-y IAM policy we need." But even those are getting harder these days because there is soo much you don't want your average operator/admin to be able to do too.

Usually people throw up their hands and over-provision generally or give each team their own AWS account and overly permissive access within it - but ring-fenced to their own stuff at least as a risk trade-off. Though I think the pendulum may be swinging back from a bazillion AWS accounts (with all those problems) back to trying to solve the IAM problem with additional new tools (CIEMs etc) that will help them to manage IAM as-a-service with a pretty UI or by letting you scope down users/roles to only the activity they have done within the last 7 days etc.

There is a great line that "complexity isn't created or destroyed - it is just made somebody else's problem" - do you want to make these an AWS IAM problem or an AWS account-management problem? A pipeline/automation problem or a heavily-staffed security team who can write great IAM policies/PBs/SCPs problem? A SaaS vendor we can procure a CIEM from problem? etc.

IAM, like all things AWS, has it's own API and rich CLI. So, we pulled this problem outside of IAM entirely.

We created a policy system that allows us to define these individual minimized policies based upon specific services that we've created. We have a tool that can then combine these small bite sized policies into a larger policy while combining compatible actions and resources giving you a resultant policy that is equivalent but often much smaller than the logical combination of all the individual policies.

You can use this the resulting policy in a variety of ways. It's very easy to just make a custom role, set this as an inline policy, and then use some custom tools to keep the policy updated.

In some cases, we went with a "policy.d" directory in a project source tree that contains symlinks to all the small specific policies it's using, and some deployment commands that use these symlinks to create a resultant policy document. If you want to add or remove a policy to a project, it's as easy as adding or removing a symlink. Likewise, it makes it much easier to audit which policies are actually attached to the project.

Thanks for your response, but what I was looking for is your opinion on what is a better solution if IAM has all these issues. We're just starting the implementation, so this is very timely.

In terms of our situation, we provide fine grained access to distributed resources, mainly data elements: think records/fields. An example is to define which user, group, and role can access which records and which fields within each record and to what extent (e.g., can't access SSN at all, can only get last 4 digits of phone number, can see first/last name, etc.).

I really liked the policy approach of IAM so my plan was to let data owners define policies that are then applied to users, groups, and roles. At run time our coordinator engine will check levels of access to each query (that could be to one data store like Postgres or Salesforce or a federated query spanning multiple data stores). By assigning a set of policies (with IAM's effect/action/resource/condition model), we can make this happen in a flexible way as I see it. Effect also has "deny," so that would be very useful for a majority of situations.

A hierarchical model like Google's as mentioned in the article doesn't seem as flexible as this IMO.

I’ve found this blogpost very helpful in thinking about designing a permission system: https://tailscale.com/blog/rbac-like-it-was-meant-to-be/
Ahh sorry. AWS IAM is a good model it just struggles to scale to ~13,000 possible API actions in the platform as it’s grown. The models all have trade-offs - but the fact Kubernetes pretty much used the original AWS IAM model for their RBAC so many years later shows it is a good one…
One potential solution: https://github.com/cerbos/cerbos. It's a standalone service (deployed alongside your app) which evaluates access decisions at runtime against contextual/arbitrary data on the principal and resources.

In your case, your resource could be a "record" for more global yes/no decisions, or perhaps as a "field" for more granular cases. Things like "can only get last 4 digits of phone number" could be achieved through attribute-based conditions set within the policies.

> I really liked the policy approach of IAM so my plan was to let data owners define policies that are then applied to users, groups, and roles

An advantage of Cerbos is that policies are defined and deployed separately from business logic in (yaml/json) config files, so no changes are required in code when policies need updating.

> At run time our coordinator engine will check levels of access to each query

Can't wrap my head around this particular part - is this checking if an entity can or cannot run a particular query, or specifically based on the "things" the query is returning?

(as a disclaimer I should mention that I work there, although Cerbos itself is Apache 2 licensed and completely free to use)

It will really depend on your workflows, what services you use, what risks you're trying to manage and what trade-offs you're willing to accept on usability vs security.

For some scenarios, resource-based policies can form the foundation of your auth flow. If you look at the flows described in the docs [1], resource evalutionn is simpler. You still need to solve the problem of effectively managing all of those resource policies and limitations on where they can be applied [2]. That might be an easier problem to solve then dealing with trying to express everything as an identity policy. You're then less concerned with wider permissions at the IAM level and move the responsibility to the owner of the resource.

[1] https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_p... [2] https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_a...

In my opinion issues start to arise with dynamic provisioning on the stack level. I do not mean cloudformation stacks specifically, but rather stacks as a "bag of heterogeneous resources deployed together, with inter-dependencies on each other and a singular purpose".

As long as you have a constant number of stacks consisting of ec2 (even with individual resources autoscaling), lambdas, whatever really, you can write an IAM policy for that. It might be tricky but generally doable.

As soon you get into a random number of stacks you also get into dynamic IAM generation and that is really hard. Add IAM adjustment for used-based inputs and sprinkle with cross-account access and there you have it: an endless stream of new IAM headaches.

Simple vs. complex is often a measure of the types of system the user "wants" vs. the range of cases that the domain might present for solution.

e.g.

* I just want to do something simple => Solution space presented is too complex

* I have a complex use case => Solution space is easy to map my problem into

Yes, it is band-aids all the way down. In my experience it is band-aids all the way down and few, very few know all the band-aids, and why they were placed there. The turn-over within the teams, and the "good enough" thinking does not help either.
Unfortunately there’s no alternative. Band-aids all the way down is the reality of successful computing platforms.
Until they get replaced one day and then go out of business.

This usually takes 5-10 years longer than customers would like.

Another challenge is to keep in mind that it all has to be granular, performant, and work at scale.

Wait until your situation has you dancing around the 6K character limit for policy documents.

The last I looked, there was a different character limit for inline (2k) vs attached (10k) policies, as well as a character limit for all aggregated policies that applied to a single principal+resource+request.

The API forbids you from exceeding the character limit for individual policies, but the latter limit is only something you can encounter at "run time" or when a request occurs. I asked our account rep at the time what would happen if the sum of all policies was larger than that character limit, they said some arbitrary policy statements would be dropped.

> Explicit allows being all you can do in an IAM policy were easy(ish) when there was a handful of AWS services and API actions. As there were more and more services and policy actions etc. they became unwieldy.

How does adding more AWS services to the platform make following least-privilege unwieldy? Surely your workload does not need permissions to each new service, so new services and new IAM permissions being available is a no-op.

Because now the central ops team has to keep updating the policies to permit access to new services as they're released or when someone complains they can't use one
The challenge is for your IT/Ops folks to develop policies that allow developers to create IAM roles that delegate appropriate permissions to services, without giving those developers permission to accidentally grant too much access to automated processes that could, by malice or misconfiguration, create security vulnerabilities via elevated access.

When new services come along, a new set of rules needs to be designed to even allow devs to try that service out.

As someone that builds platforms for a living, the core reason I believe is that AWS is now a spaghetti monolith. It was somewhat a natural consequence because the type of service AWS provides which is infrastructure. It evolves very slowly because its impact to its tenants are hard to predict. Therefore it’s logical to apply bandaids everywhere.

AWS might be in the trenches of the biggest tech debt in terms of impact known to human kind.

How to solve this? I think lifecycle management. Define lifespan of services so that these can be replaced eventually.

Another way to do it is by coupling price cuts to migrations to the new service.

(This only works if the old service is secure and maintainable enough to run indefinitely.)

And even if you manage IAM right, if your setup is complex enough you might one day find out random AWS components stop provisioning or working because you just blew the stack in IAM policy engine with a too big policy

Damned of you do, damned if you don't...

involved... you're talking like you worked on the IAM team yourself :P

thanks for the great post

I worked on the cloud/DevOps team of a big customer for more than 5 years then as an AWS SA for almost 5 after that. Now I work for an AWS security partner.

And thanks glad you enjoyed it :)