Hacker News new | ask | show | jobs
by nrmitchi 1685 days ago
As the author touches on, the main problem here isn't learning about indexes. It's about "infinity scaling" working too well for people who do not understand the consequences.

In no sane version of the world should "not adding a db index" lead to getting a 50x bill at the end of the month without knowing.

I am a strong believer that services that are based on "scale infinitly" really need hard budget controls, and slower-scaling (unless explicitly overidden/allowed, of course).

If I accidently push very non-performant code, I kind of expect my service to get less performant, quickly realize the problem, and fix it. I don't expect a service to seemingly-magically detect my poor code, increase my bill by a couple orders-of-magnitude, and only alert me hours (if not days) later.

11 comments

There's no free lunch. Cloud services trade performance woes for budget surprises. This may be preferable in some cases but the tradeoff should be recognised.
There's plenty of space in the middle though, no? Bank accounts cut you off if you hit a zero balance, or they can execute your transactions and charge you a fee. Why can't these services implement throttling or even halting if the charges hit a certain ceiling?
In some cases the query might have finished before the data hits the billing system.
That’s not an argument.

As long as query #2 doesn’t run, that’s better then nothing.

So you want to add an overhead for each query to check an internal LRU cache that then checks the billing system? Just the overhead of hashing the query into some cacheable identifier will hurt performance.
Disagree; cloud services trade reduced operation work for higher prices. There is nothing inherent to "cloud services" that requires budget surprises.
> Cloud services trade performance woes for budget surprises.

I'm not sure why you think this is a trade-off. In general cloud services automate operations. Whether they are faster is unrelated. Many are not--services that use object storage for backing storage can be orders of magnitude slower than equivalent software using nVME SSD.

Our internal monitoring alerts for performance anomalies. Quite possible to scale and warn you.
> This may be preferable in some cases but the tradeoff should be recognised.

It's not a "tradeoff", it's a product feature.

>I don't expect a service to seemingly-magically detect my poor code, increase my bill by a couple orders-of-magnitude

When you put it like that, it sounds like an awfully good business to be in.

Haha yep, I was like wait I'm used to getting feedback from the system telling me I messed up and this I barley noticed. PlanetScale has Query Statistics that are really useful for spotting slow queries but don't expose the "rows read" so you can't really tie this view back to billing. I think they're aware of this though and might expose that information.
It can't be like that. I have discussions with vendors sometimes and the first question I ask - if something lapses and we weren't paying attention - you won't cut our service right?

I think too, in most cases, people would rather run over than cut service.

Also how would such a system work? Let's say you sign up for some API and what, set your billing limit to 500 requests per day. Let's say you're now hitting fabulous numbers / signups - but suddenly you start hitting that 500. If that shuts off your signups or what have you, you're typically going to be worse off than if you just pay the overage bill.

I know it sucks, but the first time you pay your overage is probably your last.

It's important to think about this in an a la carte design, not one fixed solution for all use cases.

Step 1: You give people the ability to put in soft limits - "Warn me when I hit 500",

Step 2: You also give the ability to put in hard limits "Pull the plug at 10k" (caveat to both these things - you guarantee this at an eventual consistency level, like "Well you hit 500 but by the time our stats updated you were at 600",

Step 3: You introduce rate limits - "We're expecting 500 in a month, warn us if we hit 50 in a day or 10 in an hour".

Step 4: You introduce predictive warnings "Our statistics show you'll hit your monthly limit on the 23rd of the month"

Step 5: You put in predictive limits to allow scaling - "The last 3 months we've seen the following use trend, warn us if we exceed double that trend, cut off if we see 50x that trend"

You might set some of these limits or none of these limits depending how predictable your use case is.

> Let's say you're now hitting fabulous numbers / signups - but suddenly you start hitting that 500.

Sure, you get alerted, confirm it's reasonable, and then change your limits. You're also describing how many APIs actually work.

I'll say there is also a difference between going from 500 requests/day to 1000 requests/day, where you might say "this is probably legitimate and I want to run over", and from 500 requests/day to 25k requests/day.

One is mildly inconvenient, and the other is potentially bankrupting.

If you’re expecting something near 500, then why would you set your limit to 500. Set it towards something like 20k at least.

Or obviously if you don’t think this will be a problem, you have control to set it to uncapped.

I don’t understand what argument you have.

A billing limit feature is something that's been wanted for years, yet the most that's offered is budget alerts.
The question that follows would be: how do you know what was intended to be less performant versus optimized on-demand? The intentions can be easily inferred when the query at hand was a simple join, and to no surprise, many cloud database offerings _do_ provide optimization automation (Azure SQL will for example even automatically add obvious indexes if you let it). But what if the query did need to scan all the rows in a join, but was only a one-off, and you didn’t want to pay the continued perf and storage costs of maintaining an index? The cloud provider can’t know that, and even with proactive measures (“make it slower” can’t work because speed is part of the product design, and budget controls can only go so far before it impacts your own customers) there’s only so much that can be done. The choice of infinity scale tools comes with infinity scale costs, and so there’s a responsibility that engineers using these tools need to understand what they’re accepting with that choice.
> The question that follows would be: how do you know what was intended to be less performant versus optimized on-demand?

I'm saying that the cloud provider shouldn't try to make assumption either way, and I'm definitely not saying that it should try to manage indexes for you.

If you are typically using X ops/s, and begin using 50X ops/s, the default should not be "this customer probably wants to spend 50x their previous spend". It should maybe scale up some percentage of previous usage, but definitely not into a range that would be considered anomalous.

> The choice of infinity scale tools comes with infinity scale costs, and so there’s a responsibility that engineers using these tools need to understand what they’re accepting with that choice.

Sure, but I have never once seen one of these providers make clear that using them comes with the risk of being charged "infinity money".

Honestly, just a limit isn't bad, Just a option to "Stop all operations if bill exceeds 300$" would make this a LOT safer for most folks.
Or perhaps a “do not allocate more than $1/min” or something similar - which makes cloud servers mimic bare metal hardware - when you overload it slows down but keeps trying.
> In no sane version of the world should "not adding a db index" lead to getting a 50x bill at the end of the month without knowing.

Computers do what you tell them to do. If you are totally clueless and don't bother to take even a few minutes to try to understand a system you are using, the results are going to be poor. Thinking any system can overcome total user ignorance is the thing here that isn't sane.

What the person in this article did is like opening all your windows and setting the thermostat to 74 degrees. It will use massive amounts of energy and just keep trying to heat the house 24/7. If someone turns around after doing this and claims there is actually a problem with thermostats not being smart enough because what if someone doesn't know leaving the window open lets cold air in, well, they shouldn't be allowed to touch the thermostat anymore.

> Computers do what you tell them to do. If you are totally clueless and don't bother to take even a few minutes to try to understand a system you are using, the results are going to be poor. Thinking any system can overcome total user ignorance is the thing here that isn't sane.

In theory I agree, but this website features something like "how I nearly bankrupted myself with an AWS bill" on the homepage every month or so. People are blissfully unaware about the extreme costs they're paying to the scaling cloud providers that they often don't even need in the first place.

While I don't think services should block extreme spend all together, a monthly/weekly/daily limit would go a long way to prevent these stories. Very few services that abstract away performance costs have a good way to limit expenses. I don't know if that's intentional or if these companies just don't care, but it's infuriating to me.

It's fine to expose the same tool to both someone who doesn't know the difference between indexes and foreign keys and someone who's been building cloud infra for many years, but as a company you should be prepared to respond to your customers' most likely mistakes. This specific case would probably be hard to detect automatically, but so many wasted CPU cycles, kilowatts and forgiven bills could be prevented if someone would just send an email saying "hey, you've been using more than 10x the normal capacity today, everything alright?"

This is a lot of victim-blaming in a such a small response.

> If you are totally clueless and don't bother to take even a few minutes to try to understand a system you are using, the results are going to be poor.

Having a hosted system which behaves different than the underlying technology it's modelled on is not immediately clear. The realm of "things you don't know that you don't know" expands drastically with managed services.

> Thinking any system can overcome total user ignorance is the thing here that isn't sane.

It's never been suggested that this is possible. There is a large range of options in between "solve all user error" and "don't hand everyone a loaded foot-gun".

> Having a hosted system which behaves different than the underlying technology it's modelled on is not immediately clear. The realm of "things you don't know that you don't know" expands drastically with managed services.

So don't use managed services? They are expensive and the only thing that works consistently and well is the lock in, everything else is pretty iffy. Somehow people look at me like an idiot when I say this, but it's LESS effort to NOT use AWS and build everything yourself. I guess this seems impossible somehow, but at the scale you are ever going to operate it's not hard to just build a service to store and serve files (s3), and if you scale to the point where you can't build it easily, you will build it anyway because you can afford to hire enough engineers to build it and still save huge amounts of money. The same goes for every managed service offered on the cloud, they are not a good deal at any point, ever, for anybody.

> It's never been suggested that this is possible.

The gist of the article is they got a refund because they didn't bother to pay attention close enough to realize their queries were doing full table scans, and they didn't bother to pay close enough attention to realize this was causing the service to scale in capacity to an absurd degree.

Why not write a simple service that tracks various stats (like number of users, requests, etc.) as well as billed costs over time?

You could then get various interesting stats in real time as well as some pretty useful alerting.

Even with "infinite scale", you should still be monitoring, and be doing some form of budget monitoring.

The difference is that application performance metrics are generally available in near-real time, whereas billing metrics are 1) very platform specific, and 2) generally not even close to real time.

It's hard to react quickly when your platform has effectively transformed near-real-time performance alerts to delayed/rolled-up billing alerts (which would also be much more difficult to use to pinpoint where the underlying issue is)

If you create an inefficient process, you should be responsible for the consequences. Why would you expect some third party to take the responsibility?

If you create a horrible internal combustion engine, your gas station should not bear the costs.

If you create an inefficient internal combustion engine, you'd know because you have to go to the gas station every 5 miles. In this case it would be like someone was filling up the the gas without you knowing, and then a few weeks later you get the bill, and then you realize that your engine is inefficient.
In theory yes; in practice it’s very easy to push inefficient code to production by accident, as shown in the article.
> I am a strong believer that services that are based on "scale infinitly" really need hard budget controls, and slower-scaling (unless explicitly overidden/allowed, of course).

+1 on the budget control, but I don't think there are good arguments in favor of slower scaling.

The ability to scale on demand is sold (and bought) based on the expectation that services just meet the workload that's thrown at them without any impact on availability or performance. That's one of the main selling points of managed services, if not the primary selling point.

Arguing in favor of slower scaling implies arguing in favor of downtime. A service that's too slow to scale is a service that requires a human managing it. A managed service that is unable to meet demand fluctuations is a managed service that can't justify the premium that is charged for it.

I may have not been as clear as I should have; I'm not necessarily arguging that typical, or expected scaling action should be slowed down. Ie, throttling scaling from X -> 1.5X doesn't really make sense.

A scaling change that would be considered anomalous, and introduces an order-of-magnitude change over historical usage could be scaled more slowly.

> Arguing in favor of slower scaling implies arguing in favor of downtime.

Sure, I guess that in a limited scope, that is what I am saying. I would much rather have a short-term "downtime that requires human intervention" problem, then a long term "Johnny deployed bad code and now the company is bankrupt" problem.

> The ability to scale on demand is sold (and bought) based on the expectation that services just meet the workload that's thrown at them without any impact on availability or performance. That's one of the main selling points of managed services, if not the primary selling point.

I tend to disagree with this. Managed services are often bought on the expectation that they do not require management, deep operational knowledge, and are reliable. There's also often the trade off of upfront costs (either human or capex costs).

Scalability of obviously part of the analysis, but "scability" and "the ability to scale from 1X -> 100X in a couple seconds" are not necessarily the same thing.

> In no sane version of the world should "not adding a db index" lead to getting a 50x bill at the end of the month without knowing.

Oh, that would be actually quite useful for learning things if the bill would tell you that it got so high because you stupid dump-ass didn't use DB indices properly.

I'm every time shocked how many people using DBs don't know about indices! Those people should pay such a bill once. They would never ever again "forget" about DB indices I guess.

Of course I'm joking to some extend. But only to some extend…