Hacker News new | ask | show | jobs
by asim 1590 days ago
The AWS horror stories never cease to amaze me. It's like we're banging our heads against the wall expecting a different outcome each time. What's more frustrating, the AWS zealots are quite happy to tell you how you're doing it wrong. It's the users fault for misusing the service. The reality is, AWS was built for a specific purpose and demographic of user. It's now complexity and scale makes it unusable for newer devs. I'd argue, we need a completely new experience for the next generation.
9 comments

I'm not sure any sizable group is banging their head against a wall. Yes, AWS is complex. Yes, AWS has cost foot guns. These are natural outcomes of removing friction from scaling.

Sure we could start with something simpler, but as you may have noticed, even the more basic hosting providers like DigitalOcean and Linode have been adding S3-compatible object storage because of its proven utility.

In terms of making something meaningfully simpler, I think Heroku was the high water mark. But even though it was a great developer experience, the price/performance barriers were a lot more intractable than dealing with AWS.

> These are natural outcomes of removing friction from scaling.

Yes, and making scaling frictionless brings a very tiny bit of value for everybody, but a huge amount of value for the cloud operator. Any bit of friction would completely remove that problem.

Also, focusing on scaling before efficiency benefits nobody but the cloud provider.

>Yes, and making scaling frictionless brings a very tiny bit of value for everybody

I disagree. Using AWS in a frictionless way has made the difference between not deploying applications and deploying them. In one example, I used S3 and EC2 to deploy an app used by several thousand users at work - the deployment was completely scripted and tested before the old app was taken down. It eliminated errors in deploying, increased frequency of denying (which enabled faster security patches), reduced down time from 6 hours to zero, enabled new features for our users (due to scripted testing). Everyone won - and I got a promotion :)

AWS was originally built to run amazon workloads. When building software at amazon scale absolutely is one of the first things you think about.
Heroku did so much right. I recently was toying with some bot frameworks (think Discord or IRC, nothing spammy or review-gaming) and getting everything set up on a free tier dyno with free managed sql backing it up, and a github test/build integration, all took an hour or so. Really exceeded my expectations.

Not sure how it scales for production loads but my experience was so positive I'll probably go back for future projects.

Yeah, heroku is absolutely the best in just getting something running. Truth is most projects don't ever have to scale, either because they are hobby projects, or cause they just fail. Heroku is the simplest platform that I know to just quickly test something. If you do find a good market fit and then need to scale, then sure, use some time to get out of it. But for proof of concepts, rapid iteration, etc. Heroku is awesome.
I’ll argue that Fly.io is beginning to meet that need in a lot of ways, especially with managed Postgres now.
In this case it is absolutely the user 'doing it wrong'.

AWS allows you to store gigantic amounts of data, thus lowering the bar dramatically for the kinds of things that we will keep.

This invariably creates a different kind of problem when those thresholds are met.

In this case, you have 'so much data you don't know what to do with it'.

Akin to having 'really cheap warehouse storage space' that just gets filled up.

"It's now complexity and scale makes it unusable for newer devs. I'"

No - the 'complexity' bit is a bit of a problem, but not the scale.

The 'complexity bit' can be overcome if you stick to some very basic things like running Ec2 instances and very basic security configs. Beyond that, yes it's hard. But the 'equivalent' of having your own infra would be simply to have a bunch of Ec2 instances on AWS and 'that's it' - and that's essentially achievable without much fuss. That's always an option to small companies, i.e. 'just fun some instances' and don't touch anything else.

What do you see missing or not well explained in AWS documentation that newer devs wouldn't understand?

I started using S3 early in my career and didn't see this problem. I always thought in data retention during design phase.

My opinion is that lazy, careless or under time pressure developers will not, and then will get bitten. But it would happen to any tool. Maybe a different problem, but they'll always get bitten hard ...

Forget newer devs for a moment... I've had years of experience with S3 and sounds like the author of the article has too. Despite my years of experience in programming/DBs/etc, I'm definitely not an amazing developer.

But I learned a whole lot of new things from this article that I didn't understand from reading the AWS documentation, let alone think I had to even concern myself with some of these issues. Spotty warnings about transitional request charges?

Anyway, kudos to you for always thinking about (and i hope actualizing) retention during policies the design phase. However, while I certainly think devs bare some of this responsibility, I'm sure they're usual met with all of the usual excuses and kicking the can down the road line of reasoning from PM/PO/etc that lead to these kinds of nightmares in the beginning... Then again, it will probably be another developer or system admins' nightmare when it becomes an issue.

Even as an experience engineer, I still struggle setting the retention policy at the beginning of a new design... I'd love to hear any advice you have about how manage this incredibly important aspect?

In my previous experiences, it really boils down to the unit economics.

If a given process generates $1 in revenue over a year, and it takes pennies for AWS services, that's a good sign your design is not going to break the company's pockets down the road.

In some cases, it's not easy to narrow the unit economics so much, which adds uncertainty to your premises, and there might be market fluctuations that change the unit economics in the future. I try to anticipate which areas are most likely to change and think of a trade-off in terms of short term speed and flexibility to change later, if needed. Almost always they're a trade off.

> we need a completely new experience for the next generation

I mean, at some point, if you're (say) using some insane amount of storage, you're going to pay for that.

I would agree that getting alerting right for billing-relevant events at whatever you're currently operating at should be a lot easier than it is. And I agree that there is a lot of room to baby-proof some of the less obvious mistakes that people frequently make, to better expose the consequences of some changes, etc.

But the flip side is that infra has always been expensive, and vendors have always been more than happy to sell you far more than you need along with the next new shiny whatever.

To the extent that these are becoming implicit decisions made by developers rather than periodic budgeted refresh events built by infra architects, developers need to take responsibility for understanding the implications of what they're doing.

My theory is that single-platform clouds actually make more sense than trying to be everything for everyone. While the latter can scale to $billions, the former might actually have higher margins because it delivers more value.

An example might be something like a Kubernetes-only cloud driven entirely by Git-ops. Not TFVC, or CVS, or Docker Swarm, or some hybrid of a proprietary cloud and K8s. Literally just a Git repo that materialises Helm charts onto fully managed K8s clusters. That's it.

If you try to do anything similar in, say, Azure, you'll discover that:

Their DevOps pipelines are managed by a completely separate product group and doesn't natively integrate into the platform.

You now have K8s labels and Azure tags.

You now have K8s logging and Azure logging.

You now have K8s namespaces and Azure resource groups.

You now have K8s IAM and Azure IAM.

You now have K8s storage and Azure disks.

Just that kind of duplication of concepts alone can take this one system's complexity to a level where it's impossible for a pure software development team to use without having a dedicated DevOps person!

Azure App Service or AWS Elastic Beanstalk are similarly overly complex, having to bend over backwards to support scenarios like "private network integration". Yeah, that's what developers want to do, carve up subnets and faff around with routing rules! /s

For example, if you deploy a pre-compiled web app to App Service, it'll... compile it again. For compatibility with a framework you aren't using! You need a poorly documented environment variable flag to work around this. There's like a dozen more like this and clocking up so fast.

Developers just want a platform they can push code to and have it run with high availability and disaster recovery provided as-if turning on a tap.

> It's the users fault for misusing the service.

I believe, AWS' usage-based billing make for long-tail surprises because its users are designing systems exactly as one would expect them to. For example, S3 is never meant for a bazillion small objects which Kinesis Firehose makes it easy to deliver to it. In such cases, dismal retrieval performance aside [0], the cost to list/delete dominate abnormally.

We spin up a AWS Batch job every day to coalesce all S3 files ingested that day into large zlib'd parquets (kind of reverse VACCUM as in postgres / MERGE as in elasticsearch). This setup is painful. I guess the lesson here is, one needs to architect for both billing and scale, right from the get go.

[0] https://news.ycombinator.com/item?id=19475726

Perhaps I don't fully understand the nuances of what you're trying to do, but...

> S3 is never meant for a bazillion small objects which Kinesis Firehose makes it easy to deliver to it

Are you saying Firehose increases the likelihood of creating the "small file problem"?

If so, isn't this exactly what Firehose tries to prevent? Sure, you can set all the thresholds low and unnecessarily generate lots of small files, but you can tune those thresholds to maximize record/file size and attain a reasonable latency. If there's a daily batch job to make this data useful, then who cares about latency?

Also, why would you run a daily batch job to coalesce all these files into parquet files instead of letting Firehose just do that for you. It can also do a certain amount of partitioning if it's required.

> Are you saying Firehose increases the likelihood of creating the "small file problem"?

Firehose makes it easy to do so (when the thresholds are too low, as you point out). That is, it'd happily chug along and do what you ask of it to. Sometimes, these problems only manifest in the long run (kind of like a frog in boiling water).

> Also, why would you run a daily batch job to coalesce all these files into parquet files instead of letting Firehose just do that for you.

Firehose recommends that the output be at least 64M to 128M for parquet files... we don't have anywhere near that much amount of data to yeet out of Firehose, especially because data is partitioned per-user (and a single user doesn't generate anywhere near that much data, and so we're left with the current setup). And so: It was either to let Firehose batch the data up in larger parquets (and run the partitioning job offline), or employ its partitioning magic online (and run the merge job offline, on-demand). We chose the latter for cost efficiency given our workloads.

HackerNews loves to criticize the cloud. It always reminds me of this infamous Dropbox comment: https://news.ycombinator.com/item?id=9224

The cloud abstracts SO MUCH complexity from the user. The fact that people are then gleefully taking these "simple" services and overloading them with way too much data, and way too much complexity on top is not a failure of the underlying primitives, but a success.

Without these cloud primitives, the people footgunning themselves with massive bills would just not have a working solution AT ALL.

> The people footgunning themselves with massive bills would just not have a working solution AT ALL.

Sometimes guard rails are a good thing, and the AWS philosophy has very firmly been against guard rails, especially related to spending. The issue has come up here again and again that AWS refuses to add cost limits, even though they are capable of it. Azure copied this limitation. I don't mean that they didn't implement cost limits. They did! The Visual Studio subscriber accounts have cost limits. I mean that they refused to allow anyone to use this feature in PayG accounts.

Let me give you a practical example: If I host a blog on some piece of tin with a wire coming out of it, my $/month is not just predictable, but constant. There's a cap on the outbound bandwidth, and a cap on compute spending. If my blog goes viral, it'll slow down to molasses, but my bank account will remain unmolested. If a DDoS hits it, it'll go down... and then come back up when the script kiddie get bored and move on.

Hosting something like this on even the most efficient cloud-native architecture possible, such as a static site on an S3 bucket or Azure Storage Account is wildly dangerous. There is literally nothing I can do to stop the haemorrhaging if the site goes popular.

Oh... set up some triggers or something... you're about to say, right? The billing portal has a multi-day delay on it! You can bleed $10K per hour and not have a clue that this is going on.

And even if you know... then what? There's no "off" button! Seriously, try looking for "stop" buttons on anything that's not a VM in the public cloud. S3 buckets and Storage Accounts certainly don't have anything like that. At best, you could implement a firewall rule or something, but each and every service has a unique and special way of implementing a "stop bleeding!" button.

I don't have time for this, and I can't wear the risk.

This is why the cloud -- as it is right now -- is just too dangerous for most people. The abstractions it provides aren't just leaky, the hole has razor-sharp edges that has cut the hands of many people that think that it works just like on-prem but simpler.

> There is literally nothing I can do to stop the haemorrhaging if the site goes popular.

There’s WAF with rate based limiting to prevent script kiddies for randomly hitting your URLs for files to download and run up your egress prices. Waf costs $5/month plus a flat fee per extra rule.

For DDOS protection there’s Shield which is built into cloudfront and should be enough for most people but if you need more control they have Shield Advanced.

The “Stop Button” for s3 is an application layer responsibility, imho though S3 Should make clean up easier.

Awesome. So I should spend more money to protect myself from flaws in Amazon's billing model with a service that I don't need for static file serving.

This kind of "blame the user" thinking is why I avoid the cloud for my own use, and can't recommend it for most customers unless they have a specific reason.

Think of it the other way. Instead of it being "it costs more to have the safety features", it's "it costs less if you don't need the safety features".

If you want to spend the absolute bare minimum price,you get the bare minimum service.

> blame the user

Not sure how this is blame the user. If you are setting up a bare metal server for a client and they don't ask you for (say) DDOS protection, will you still set up a DDOS protection protocol for them? I would think not since most people would try to match what a client asks for and maybe throw in some freebies.

If after that, they get hit by DDOS, the onus is on them to have told you to plan ahead for it and knowing this is not "blame the user'.

This is exactly what AWS is also offering - a basic setup and extra bells and whistles to protect yourself from possible issues based on your threat model.

Maybe I'm missing something in your response.

There are two kinds of outcomes from a DDoS:

1. an outage, which in reality is just an inconvenience, not the end of the world, unlike what most IT people seem to think.

2. a bill that can bankrupt you, which may as well be the end of the world for many people or small businesses. It can be literally "game over".

A bare metal box doesn't need protection from the 2nd risk. Its costs are fixed, irrespective of the amount of traffic attempting to hit it. A 100 Mbps link can't put out more than 100 Mbps, so even if you're charged by the terabyte of egress, there's a cost ceiling integrated into the hardware itself.

The cloud generally has no such limits, or much, much higher ones than is typically desirable.

Okay, here's another random example your WAF will not protect you from: cloud-hosted DNS.

The bare metal scenario is a box sitting on the end of a 1 Gbps Ethernet link. If attacked by some crazy UDP DNS flood attack, it could probably saturate that pipe and send out... 1 Gbps. On a fixed-cost-link plan this costs $0.00 additional money. You might have an outage, or merely a brown-out, but you won't see a cent added to your next bill.

On Azure's DNS Zones service, there's no "1 Gbps" pipe to rate limit them. They have infrastructure deployed globally, typically with 100 Gbps links. In practice, the DNS server probably only gets about 10 Gbps per region, but there's many regions. At 100 bytes per packet, you could be looking at a billion requests per second billed to your account, at an eyewatering $200/s or $720K/hour. Ouch!

Now, Azure will probably forgive that bill because it's clearly an attack.

But what if it isn't clearly an attack? Application Insights by design puts the Instrumentation Key into client-side JavaScript. It charges $3/GB on ingress! It's trivial to charge someone thousands or tens of thousands of dollars before they notice, and then they'd have a hard time convincing support that the traffic wasn't legitimate.

I can send a terabyte out for cents, each of which would cost some poor fool $3,000.

Good luck plugging every such hole, monitoring every alert (there's literally tens of thousands of metrics to alert on), and keeping up with every spike in billing that's a day late reporting on costs that can ramp up to thousands of dollars per minute.

What does an automatic cost limit look like when you have metered storage services? Start killing customer data?
A surprise massive bill can be worse than no solution at all, in my opinion. And if the easy path leads to massive-bill lock-in, that's also not very helpful. It's not like people didn't know how to run servers and remote storage before AWS showed up. Before AWS showed up at least your data-center costs were pretty predictable: you managed the servers yourself, so whatever the salaries summed to, that was it. It's not like poor programming ate up your years' IT budget by May.
But you did waste money through capacity planning. As you either have the exact quantity of servers needed, or you have some sitting there idle. Then there’s swapping out older/failing hardware. In fact your DC calculations are rather difficult to get accurate.
I agree 110%.

Actually, I disagree with one statement: "AWS was built for a specific purpose and demographic of user". AWS wasn't built for anyone. It was built for everyone, and is thus even reasonably productive for no one. AWS's entire product development methodology is "customer asks for this, build it"; there's no high level design, very few opinions, five different services can be deployed to do the same thing, it's absolute madness and getting worse every year. Azure's methodology is "copy whatever AWS is doing" (source: engineers inside Azure), so they inherit the same issues, which makes sense for Microsoft because they've always been an organization gone mad.

If there's one guiding light for Big Cloud, its: they're built to be sold to people who buy cloud resources. I don't even feel this is entirely accurate, given that this demographic of purchaser should at least, if nothing else, be considerate of the cost, and there's zero chance of Big Cloud winning that comparison without deceit, but if there was a demographic that's who it'd be.

> I'd argue, we need a completely new experience for the next generation.

Fortunately, the world is not all Big Cloud. The work Cloudflare is doing between Workers & Pages represents a really cool and productive application environment. Netlify is another. Products like Supabase do a cool job of vendoring open source tech with traditional SaaS ease-of-use, with fair billing. DigitalOcean is also becoming big in the "easy cloud" space, between Apps, their hosted databases, etc. Heroku still exists (though I feel they've done a very poor job of innovating recently, especially in the cost department).

The challenge really isn't in the lack of next-gen PaaS-like platforms; its in countering the hypnosis puked out by Big Cloud Sales in that they're the only "secure" "reliable" "whatever" option. This hypnosis has infected tons of otherwise very smart leaders. You ask these people "lets say we are at four nines now; how much are you willing to pay, per month, to reach five nines? and remember Jim, four-nines is one hour of downtime a year." No one can answer that. No one.

End point being: anyone who thinks Big Cloud will reign supreme forever hasn't studied history. Enterprise contracts make it impossible for them to clean the cobwebs from their closets. They will eventually become the next Oracle or IBM, and the cycle repeats. It's not an argument to always run your own infra or whatever; but it is an argument to lean on and support open source.

> Azure's methodology is "copy whatever AWS is doing" (source: engineers inside Azure), so they inherit the same issues, which makes sense for Microsoft because they've always been an organization gone mad.

I guess this, but it's funny to see it confirmed.

I got suspicious when I realised Azure has many of the same bugs and limitations as AWS despite being supposedly completely different / independent.

That’s just it, though: it isn’t an AWS horror story. It’s the sorcerer’s apprentice.