Hacker News new | ask | show | jobs
by billisonline 1737 days ago
> we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours

May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?

Btw, this reminds me a lot of one of my own early career screw-ups, where I had a batch job uploading images that was set up with unlimited retries. It failed halfway through, and the unlimited retries caused it to upload the same three images 100,000 times each. We emailed Cloudinary, the image CDN we were using, and they graciously forgave the costs we had incurred for my mistake.

3 comments

> May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?

AWS support caught it before we did, so they did something on their end to throttle the Lambda invocations. We asked for billing forgiveness from them; last I heard that negotiation was still ongoing over a year after it occurred.

Part of the problem was we had temporarily disabled our billing alarms at the time for some reason, which caused our team to miss this spike. We've enabled alerts on both billing and Lambda invocation counts to see if either go outside of normal thresholds. It still doesn't hard-stop this from occurring again, but we at least get proactively notified about it before it gets as bad as it did. I don't think we've ever found a solution to cut off resource usage if something like this is detected.

Earlier in the week there was threads about how AWS will never implement resource blocking like you're talking about because big companies don't want to be shut off in the middle of a spike of traffic, and small companies don't pay enough money, and it's not like it hurts Amazon's bottom line
We use memory safe languages, type safe languages. AWS is not fundamentally billing safe.

Just to give you nightmares. There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

I don't know how you monitor it, part of the issue is the sheer complexity. How do you know what to monitor? The billing page is probably the place to start - but it is too slow for many of these events.

I guess you could start with the common problems. Keep watchdogs on the number of lambdas being evoked, or any resource you spin up or that has autoscaling utilization. Egress bandwidth is definitely another I'd watch.

Dunno, just seems to me you'd need to watch every metric and report any spikes to someone who can eyeball the system.

For me? I limit my exposure to AWS as much as I reasonably can. The possibilities combined with the known nightmare scenarios, with a "recourse" that isn't always effective doesn't make for good sleep at night.

> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

AWS Shield Advanced actually offers DDoS cost protection to mitigate this specific risk: https://aws.amazon.com/shield/features/

> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

That’s interesting because I seems like it would happen, but what is in it for the attacker, whrn under threat they can implement caps?

A severe enough bill can cause an organization to be instantly bankrupt. No opportunity to try to do something like caps.

Regardless, turning on spending caps isn't a final solution to this particular attack. With caps the site/resources will hit the cap and go offline. Accomplishing what a DDoS generally tries to accomplish anyway.

The only real solution is that you have to have a cheap way to filter out the attacking requests.

Could only be an attack of spite, can’t really hold a ransom because the IPs of malicious traffic could be blocked or limits set after initial overspend. Perhaps if the botnet was big enough.
Some people get paid to destroy competition, others just enjoy watching the world burn...
I think you're limited to 1,000 concurrent Lambda invocation by default anyway. That said, it's not easy to get an overview of what's going on in an AWS account (except through Billing, but I don't know how up to the moment that is).
I've been able to get AWS support to waive fees for a runaway Lambda that no one spotted for a few weeks - they wanted an explanation of what happened and a mitigation strategy from us and that was it. It is still unresolved because AWS wants us to pay the bill so they can then issue a credit but the company credit card doesn't have a high enough limit to cover the bill.