| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jacobra2 2269 days ago
	What was the mistake?

4 comments

appstorelottery 2269 days ago

Running up a shitload of instances for testing and leaving all of them running overnight. Each of these instances continually rendered 4k video data to storage. This kind of test was supposed to be 1000x smaller, running for at most 10-20 seconds at time. He had written his own provisioning system which - according to his report - failed to properly manage instances "weird" edge case. No kidding.

appstorelottery 2269 days ago

Every morning I would check AWS billing just out of habit. I'm just thankful I did - otherwise everything would have kept running...

The lesson for me was don't trust your internally-hacked-together instance management system. The AWS interface to storage and instances is the base truth. And perhaps more importantly - I'm never getting into another startup which has financial risk like that without being a core expert in that risk/tech. I was focused on the business + client code - and had very little clue about the nitty-gritty of AWS. I should have been more involved with the code on that side, or at least the data-flow architecture.

epiphanitus 2269 days ago

SRE here. I feel for your situation. Here's some advice. One simple thing you could do is set up AWS billing alarms and have them delivered to a notification app like PagerDuty.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori....

If you don't want to pay for PD, you can patch together any number of ways to get your phone to scream and holler when it gets an email from ohshit@amazonasws.com. It's also good to have clear expectations as to whose responsibility it is to deal with problem x between the hours of y and z and exactly what they are supposed to do.

Keep the alerts restricted to the really important stuff, because if your team becomes overloaded with useless alerts they will 1) dislike you and 2) be more prone to accidentally mistaking a five alarm fire for a burnt casserole.

There are more complex systems you could build, but that's a start.

HenryBemis 2269 days ago

Thank you for this. How can anyone run ANY service with ANY company and not add a clause in the contract (and then have the alerts up an running) in controlling costs?

I remember PagerDuty was advertising (a lot) on Leo Laporte's podcasts a few years back.

A clause in the contract: if monthly bill reaches $Xk amount then:

(a) seek written approval by client, and

(b) continue until $Yk or approval is given with a new ceiling price.

MattSayar 2268 days ago

I was just playing around with AWS a while ago and was surprised that I could not find any option to put a cap on the amount I'd spend in a month. Only thing I could do was set up alerts.

I imagine AWS would have 0 problems suspending all my services if I can't pay, so why can't it do the same thing when it reaches my arbitrary cap?

lowercased 2269 days ago

> I'm never getting into another startup which has financial risk like that without being a core expert in that risk/tech

This may be something that is 'unstated', but unless you actually had access to fix something that was wrong, as well, being an expert in that wouldn't really help all that much. I've been in situations where I have explicit/expert knowledge of XYZ, but when the people responsible for XYZ do not take your input, and/or don't provide you the ability to fix a problem, expert knowledge is useless (or worse, it's like having to watch a train wreck happen when you know you could have stopped it).

Aeolun 2269 days ago

This. But on the other hand, you can be ready with the popcorn when shit eventually does hit the fan.

bluecmd 2269 days ago

And then have to live with asking yourself "could I have done more?"

dmos62 2268 days ago

As in beer and crisps? /s

amiga 2268 days ago

"...could I have saved the day if I were willing to loudly complain until someone listened?"

justinclift 2269 days ago

On the other hand, it sounds like you hired someone who wasn't really up for the level of responsibility given. :(

In theory ;), you shouldn't have to be a core expert in everything. But yeah... in the real world, things aren't so cut and dry. :/

nitely 2269 days ago

TBH, the real problem is AWS bills cannot be capped in any way (you can setup an alarm, though). It's unreasonable to expect a programmer won't make mistakes.

manigandham 2269 days ago

Of course they can be capped, you just turn off the services. If you're asking them to automate that for you, then the counterpoint would be people accidentally setting a budget that wipes out their resources and complaining about that.

Easier for both sides to just ask AWS for a refund if there's a reasonable case.

nicoburns 2269 days ago

> the counterpoint would be people accidentally setting a budget that wipes out their resources and complaining about that.

This wouldn't be an issue if it was configurable.

dragonwriter 2269 days ago

> Of course they can be capped, you just turn off the services.

That's not a he's cap, since turning off services isn't instant and costs continue to accrue. But, yes, there are ways to mitigate the risk of uncapped costs and they are subject to automation.

ngcc_hk 2269 days ago

Should be cap so you have a check. If your system does not allow threshold or assertion, please do not use it. If your cloud system do not have capped budget so you play in and alert you when you soon run out, do not use it.

EpicEng 2269 days ago

>In theory ;), you shouldn't have to be a core expert in everything. But yeah... in the real world, things aren't so cut and dry. :/

Right. In my experience, if you don't understand what's going on beneath your abstractions, you're always in for a world of hurt as soon as something goes sideways.

glenngillen 2269 days ago

Did you reach out to AWS support or your account manager? They’d definitely have worked something out.

shawabawa3 2269 days ago

Did you contact AWS and let them know it was a mistake?

They have a good track record of cancelling huge bills the first time they happen

seibelj 2269 days ago

Assuming you were incorporated and had a business account - declare bankruptcy and the bill goes away. I don’t understand why you would still pay the bill if you were going out of business anyway.

appstorelottery 2269 days ago

Why didn't I file bankruptcy? This happened in Australia and declaring bankruptcy was not the right thing to do - for many reasons, not the least of which it makes it much harder to operate as a director of a previously bankrupt company, but in the worst case my bank would have just gone after me as I'd given a personal guarantee.

garmaine 2269 days ago

There is no concept of limited liability in Australia?

jkaplowitz 2269 days ago

Even in the United States, most small business loans require personal guarantees which narrowly override the corporate limited liability to make that guarantor liable for that debt if the company doesn't pay. There are some rare exceptions, and possibly more for startups funded by big-name VCs, but I don't know.

hnick 2269 days ago

If a director becomes personally bankrupt (such as trying to be the good guy and using personal guarantees to take on company debts in an effort to scrape through) then they're banned from running a company until it clears. If they're the director of a company that goes bankrupt, I believe they get 2 chances (companies) before there's a chance of being banned from running more for a time.

Either way it might be nice to keep your options open, depending on your plans.

scarface74 2269 days ago

Or you could just send an email to support and ask them to waive the charges.

lostlogin 2269 days ago

If that got to the right person on the right day and they knew it was going to kill the company, it seems likely to help. And combined with the fact that it would probably guarantee future revenue way off into the future...

scarface74 2269 days ago

I have never heard of a case where they wouldn’t give refunds. AWS is competing with the 95% of compute that is not running in the cloud (their own statistics). The last thing they want is a reputation that one mistake will bankrupt a business.

aledalgrande 2269 days ago

Once I got something like a year of EC2 charges retroactively reimbursed for a few instances I hadn't used.

staticassertion 2269 days ago

I've repeatedly seen requests of this nature handled by AWS - 75% cuts to billing, 90% cuts even.

aojdoiasjdasd 2269 days ago

This. I work at Amazon and this is more common than you'd expect. "Customer obsession" and all that.

hnick 2269 days ago

I'm not the type to 'want to speak to the manager' for my self-imposed problems but the more I hear about people coming out ahead the more I think I need to change my ways.

cowsandmilk 2269 days ago

Yep, and an opportunity to educate on things like budgets and billing alarms to try to prevent this in the future.

teddyuk 2269 days ago

Yeah, every time I’ve heard this story support have always fixed it, at least the first time per account

101404 2269 days ago

AWS should have a cost cap. Set a max spend value and shut down all servers if you spent it.

dragonwriter 2269 days ago

> AWS should have a cost cap. Set a max spend value and shut down all servers if you spent it.

That might make sense for some particular services (e.g., capping the cost on active EC2 instances) but lots of AWS costs of data storage costs, and you probably don't want all your data deleted because you ran too many EC2 instances and hit your budget cap.

Where exactly you are willing to shut off to avoid excess spend and what you don't want to sacrifice automatically varies from customer to customer, so there's no good one-size-fits-all automated solution.

JamesBarney 2269 days ago

I think if resources had an option of "At cap: Do nothing, Shut down, shutdown and erase data" that would cover most of the use cases.

jfkebwjsbx 2268 days ago

Keeping the data for a week but completely inaccessible would not be a huge cost for AWS yet a big relief for startups.

samstave 2268 days ago

We used to have a bunch of billing graphs in stack driver with alerting thresholds to pagerduty to capture exactly situations like this.

tomerico 2269 days ago

Why is there no way to set a limit on billing on AWS? Especially for cases like this, where killing testing instances does not have a dramatic negative effect...

strongbond 2269 days ago

Agreed. The simple solution is an expenditure cap. Why can't Amazon implement one? The fear of it going wrong like this would make me keep away from AWS forever.

ehsankia 2269 days ago

Wait, is there really not one on AWS? I thought this was the #1 most important feature on any such cloud systems.

It's the very very first thing I set when setting up my GCloud hobby project. I was like, this is fun and all, but I don't care about this enough so I limited it to 3$ per day and 50$ per month. If it goes above, I'm very happy to let it die, and it also gives me a warning so I know something is up. The 2 times it triggered, there was something I managed to fix so the tool is still up and running costing pennies.

Itsdijital 2269 days ago

I got pegged to the wall by aws once on a hobby project. $1500 racked up in two months. Apparently I left a snapshot in some kind of instant restore state to the tune of $0.75/hr. I used the instance for 2 days, and then shut everything down. Or at least thought I did.

The account I did it on was tied to my "junk" email, so I didn't catch amazon banging on my door saying my payment info needed to be updated. Well until I did happen upon one of the emails. Nearly had a heart attack.

Talked to aws support and they full refunded me. Very very kind of them, but now I'm terrified to touch anything aws.

ufmace 2269 days ago

I don't think an expenditure cap is so simple. Exactly what happens when you hit it? If you have, let's say, 3 RDS DBs and 20 EC2 instances running and a bunch of stuff in S3 and a few dozen SQS queues and a few DynamoDB tables etc, and your account goes over the limit, how do you decide which service you want to automatically cut?

JamesBarney 2269 days ago

So 90% of the time I hear these horror stories it's a test/dev account where deleting everything is preferable to getting a bill.

I also don't understand why everyone is assuming

"if I hit threshold X do A, if I hit threshold Y do B" where A and B are some combination of shutting down and deleting resources,

is as difficult as solving NP complete.

jimmaswell 2269 days ago

> Why can't Amazon implement one?

Greed, I'm assuming.

joshvm 2269 days ago

Nowadays quotas give you some safety net. For example you usually have to request more than one GPU to avoid burning money that way, or more than say 32 instances. It should not be possible for a new account to spawn 1k VMs overnight.

The problem with billing is that often these charges are not calculated instantly, and others are not trivial to deal with. For example what happens if you go over budget on bandwidth or bucket storage, but still within quota? What do you kill? Do you immediately shut down everything? Do you lose data? There are lots of edge cases.

You can normally write your own hooks to monitor billing alerts and take action appropriately.

zknz 2269 days ago

There are service limits on new accounts per region - 20 EC2 instances. These require a ticket lodged to over-ride.

nicoburns 2269 days ago

You can still burn an awful lot of money with 20 EC2 instances.

grecy 2269 days ago

... put a credit card on the account that only has a $1000 limit. Or better yet, a prepaid one.

monktastic1 2269 days ago

In this case wouldn't it just cause Amazon to send you a notice that the $10k overnight charge was declined and you should enter another payment method?

user5994461 2269 days ago

How many is a shitload of instances? Are we talking tens, hundreds, thousands?

In my experience AWS had very stringent limits on the amount of active instances of each type (starts around 10 for new accounts, 2 for the more expensive instances). It takes tickets to support then days of waiting to raise these limits.

That should have prevented your company from creating tens of instances, let alone hundreds, unless that's already your typical daily usage.

redis_mlc 2269 days ago

There used to be no limit on EC2 instances.

yingw787 2269 days ago

Holy crap dude, that's some nightmare shit right there.

Does AWS update the billing console per day or upon request? I get charged per month, but I should add a habit in my habit tracker to learn more about my expenses...

freeone3000 2269 days ago

Hourly. You can also set up billing alerts, which will email you.

WrtCdEvrydy 2269 days ago

Be aware that some services bill asynchronously so it can take 24 hours in some instances.

lostlogin 2269 days ago

This is what was needed.

x86_64Ubuntu 2269 days ago

What's the technical process to ensure that this never happens? Nowadays, having to have someone "watch" the test and then kill the instances is manual labor which is a no-no. So how do you make it so that your test fires up the instances, and then kills them when the test is done.

jrockway 2269 days ago

I think you have to have an upper bound set with AWS that kills stuff when you have reached the amount of money you want to spend. But of course, people would whine about that. "How AWS killed my business on the busiest day of the year," would probably be the article title.

JamesBarney 2269 days ago

But I hate far more sympathy for "I made an AWS mistake and got hit with an 100k bill" than "I told AWS to turn off my ec2 instances at 10k, and then at 10k it turned off my ec2 instances"

catlifeonmars 2269 days ago

There are many ways to solve this problem. One way to do this is to model your test infrastructure in CloudFormation. You can then use an SSM Automation Document to manage the lifecycle of your test. Putting all your infrastructure in CloudFormation allows you to cleanup all of the test resources in single DeleteStack API call, and the SSM Document provides: (1) configurable timeout and cleanup action when the test is done, (2) auditing of actions taken, and (3) repeatability of testing.

Twirrim 2269 days ago

Not sure if this would help in this particular scenario, but unit and integration testing of operations scripts can save a lot of pain, anguish and $$s too.

It's horrifying how many places treat writing tests for services as critical, but then completely fail to write tests for their operational tooling. Including tools responsible for scaling up and down infrastructure, deleting objects etc.

dirtydroog 2269 days ago

But if a test fails does it now mean you're bankrupt?

Twirrim 2269 days ago

Could do? Not sure what your point is here.

rcxdude 2269 days ago

You can do timed instances, and/or make the instances have timed job to shutdown after a fixed time (which is what I use to shut down an instance which only gets spooled up for occasional CI jobs after an hour).

bluecmd 2269 days ago

+1. When I had to use AWS for batch workloads, which at the time at least didn't have a TTL attribute on VMs, I made sure that the VM first scheduled a shutdown in like 30 min if the test was supposed to only run in 10 min.

dev_throw 2269 days ago

You can use auto scaling groups with a load balancer to terminate instances when not in use and spin them up as required.

WrtCdEvrydy 2269 days ago

This is why it's Terraform or nothing for me.

scubbo 2269 days ago

I'd be fascinated to hear how Terraform would have intelligently known that those instances were not meant to stay on overnight.

WrtCdEvrydy 2269 days ago

Honestly, I'd create the instances using an ASG, then set the ASG size to 0 (or throw inside a while loop until any errors go away). Always create instances from an AMI and always put them in an ASG (even if the ASG only has 1 item min, target, and max on it).

jenkinstrigger 2268 days ago

I love Terraform and ASGs but that still doesn't solve the fact that their SRE overprovisioned. They might have even used both things!

smiths1999 2269 days ago

This has happened to me several times, albeit at a much smaller scale. I fire up a few GPU instances for training neural networks and when I got to shut the instances down I forget that you always need to refresh the instance page before telling AWS to stop my instances. I still go through all the confirmations saying I do, indeed, want to stop all instances. However, these few times I forgot to refresh to make sure they actually were shutting down and simply went to bed. Not an $80k mistake, but certainly a couple hundred dollars, which hurts as a grad student.

Now I have learned, _always_ refresh the page and instance list prior to shutting anything down and _always_ confirm the shutdown was successful.

BluePen7 2269 days ago

Not who you asked, but my mistake was transferring an S3 bucket full of unused old customer web assets to glacier, we were paying a lot to host them each month, and weren't using them anymore.

I set the lifecycle rule on all objects in the bucket, for as soon as possible (24 hours).

About 2 days later first thing in the morning I get a bunch of frantic messages from my manager that whatever script I was running, please stop it, before I'd even done anything for the day.

The lifecycle rule had taken effect near the end of the previous day, and he was just getting all the billing alerts from overnight, it was all done.

I read about glacier pricing, but didn't realize there was a lifecycle transfer fee per 1000 objects (I forget the exact price, maybe $0.05 per 1000 objects). That section was a lot further down the pricing page.

The bucket contained over 700 million small files.

I'd just blown $42,000.

That was over a month's AWS budget for us, in the end AWS gave us 10% back.

On the plus side, I didn't get in too much trouble, and given we'd break even in 4 years on S3 costs, upper management was gracious enough to see it as an unplanned investment.

TLDR: My company spent 42k for me to learn to read to the bottom of every AWS pricing page.

Nition 2269 days ago

What would have been the correct solution here? Group them into compressed archives first to reduce file count?

jenkinstrigger 2268 days ago

One .zip to rule them all :)

Nition 2268 days ago

Haha, I original wrote "one giant zip file?" but I decided to rephrase it as a more serious answer.

stainforth 2269 days ago

Why would they create a pricing structure like that instead of ultimate total size?

quickthrower2 2269 days ago

Using a post paid service and getting bill shock.