Hacker News new | ask | show | jobs
by eridius 3235 days ago
It is a serverless issue because if you were using your own server, a mistake like this wouldn't have cost money, it would have just degraded your service (or possibly brought it offline).

So I guess the question is, with a mistake like this, is it better to be charged hundreds or thousands of dollars, or to have your service degrade or go offline until you can fix it?

6 comments

Could you just do serverless where it starts to rate limit one you reach a certain cost? It seems like this is an issue that could be fixed somehow
Yes, but that isn't always the right answer. If your system is starting to cost "a lot" is that because of a bug (this case), or is it because your idea just "went viral" and you are not getting tons of paying customers signing up?

If it is the latter you do not want any rate limiting, you want everything to scale as fast as possible (I hope there are no bugs on your end). Rate limiting means that your new customers get a poor experience and so they are more likely to ask for a refund, or not renew next time.

It's almost as if... they should offer multiple options so customers could choose based on their business/hobby needs:

1. Warn me at $X but don't throttle me for any reason--I'll pay if I go viral

2. Warn me at $X and start throttling until I get to $Y at which point stop service and stop charging

3. Warn me at $X and stop service/charging immediately

The market has done this, in part, through segmentation.

When you are on shared hosting, the expectation is that you get shut off when you go over.

When you are on "unlimited" shared hosting, the expectation is that you and everyone on the server gets throttled when you go over.

When you are on a VPS, the expectation is that you will be throttled when you go over, and you will be throttled much less than with other options when your neighbor goes over.

With cloud, then, the expectation is that if you go over, you are charged more proportionately, but things continue to work.

Of course, this is a simplification, but I think it accurate enough to be useful.

I do agree that it would be better to choose your api/provisioning and node reliability separately from overage behavior, but most of these behaviors and expectations were based on traditions that were shaped by technical constraints.

To credibly say "we will keep you online and just charge you" you need a lot of spare capacity.

Throttling one customer on a shared host without impacting other customers used to be very difficult. It is still way easier to throttle one VPS customer, and easier stil to throttle that one customer when they have their own kernel and reserved memory; it is not as big of a deal as it once was, considering everyone now uses ssd, but systems that share page cache are notoriously difficult to setup such that light users don't impact heavy users.

AWS has so many services that trying to decide what to stop if you reach a billing threshold would be impossible to automate. Similarly, pricing is not built into the individual services APIs, so adding a per-item billing threshold would not be a trivial task.

> based on their business/hobby needs

AWS is not interested in hobbyists - other vendors are picking up the crumbs there.

You can optimize step 3 away into step 2.
This is splitting hairs though. It's a mistake in the code that caused it to do something unexpected that costs money. In the serverless world, that means invoking a function repeatedly, costing money. In the old-server world, maybe it means your script had a bug that downloaded an image repeatedly, causing you to rack up networking charges.
It is a mistake, yes. But this particular mistake would have behaved very differently on a normal server. Just because there exist mistakes you can make that would have the same consequences on regular server vs serverless doesn't mean you can just shrug your shoulders and say all mistakes are the same.

The fundamental issue here is serverless is great at allowing you to automatically scale to meet demand, but it also is great at automatically scaling to meet unexpected resource usage caused by errors (or poor design). And so this means a mistake on your end can cost you a lot of money, because the system thought that it was real demand.

Isn't there also a third danger with anything that scales your bill as your app scales - the possibility of some black hat ddos-ing you for the hell of it?
Yes, but I guess in that case you would put your lambda function behind an API gateway, and limit the user requests. If it's a static content you would serve it from a CDN. Not a specialist on this, but that's what I would do.
Wouldn't an API gateway typically limit requests per IP/end user?

I guess it could limit global request rate. But the idea of unbounded elastic services behind a global rate limiter is just funny to me. Like a Ferrari with a 50mph limiter.

Yes, I still don't get it.
> It is a serverless issue because if you were using your own server, a mistake like this wouldn't have cost money.

We dynamically create and instantiate new servers based on load and if it's sustained for a while. Once it's up, it's added to the load balancer. Once the load of them goes down, it's spin down after it's spent some time idle (it costs to instantiate so might as well keep outside of the queue for a bit before completely removing it).

This all runs automatically. If we don't limit it, it's on us.

How is this not a problem with how he managed it?

> This is probably the most stupid thing I ever did. One missing return; ended up costing me $206.

He clearly mentioned it's his error there.

There are chances that degradation or unavailability are not free as in beer.

If the degraded or offline system is used by people, and these people cannot work, the cost can be a lot higher. For example, 10 people not able to work could cost something in the range of $250-$750 per hour.

Moreover, if customers are lost due to this degradation of service and CAC is high, then clearly the cheapest thing is a high bill by AWS, which probably is also capped by Amazon (and handled as an alert by Amazon).

Oh sure. That's why I posed it as a question. Service degrading or going offline could be disastrous and cause losses of thousands of dollars or more, depending on what the service is. But there's also plenty of services where it's cheaper to have the service go down than it is to get an outsized AWS bill. This is just something you need to be aware of when deciding if serverless is the way to go.
The developer should be writing unit tests for their code so they can avoid small mistakes like this.
It is impractical to cover every line of code with tests (it would get too expensive). Futhermore, in this case the author would have to test production config interacting with Amazon servers rather than a piece of code.

And even 100% code coverage doesn't find all possible errors.

Do you need to cover every _line_ of code, or do you need to test resulting behavior? Also, while nothing is 100% foolproof, the example here would probably have been caught.
I doubt this, a unit test wouldn't have covered the infinite triggering of the created events
> because of a refactor, I forgot the return statement and it just continued overwriting the file again

Unit tests are specifically useful for refactors. You can refactor your code and ensure that it behaves as intended. Integration tests are great, too, don't get me wrong. Either or both would have probably caught this.

this was a problem involving multiple parts, unit tests normally don't catches this. You need an integration/functional test and that can be much more time consuming to write for all "integrations" and code paths.
It was a single function that changed behavior after a refactoring. It did work where it did not need to, because the work was already done on the object. This is only hard if you don't test at all and can't already mock the object download/upload or don't have pure functions.
In this case it might have helped, I didn't read the code, but in a more general case these kind of things are rarely found by unit tests. I still doubt that the triggering caused by the file change would have been found in a unit test.
Down vote for advocating for unit tests? That's just good practice in general.
I think integration tests would be more appropriate here, especially since there are different co-operating moving parts: S3 <--> EC2/Lambda.
I notice AWS doesn't have any ability to set limits....