Hacker News new | ask | show | jobs
by Grimm1 1681 days ago
I feel like indexes are a pretty fundamental type of DB knowledge. In fact I'd say it's table stakes knowledge you should have if you're working with them. Further more, knowing that ForeignKeys typically apply an index to that column is also in my head basic knowledge. I'm sorry you got burnt, and congrats on learning a lesson, but you could have gotten the same knowledge by ever googling MySql ForeignKeys and saved yourself a headache.

In fact it's like a big bullet point near the top of the docs page.

"MySQL requires indexes on foreign keys and referenced keys so that foreign key checks can be fast and not require a table scan. In the referencing table, there must be an index where the foreign key columns are listed as the first columns in the same order. Such an index is created on the referencing table automatically if it does not exist. This index might be silently dropped later if you create another index that can be used to enforce the foreign key constraint. index_name, if given, is used as described previously."

I'm not entirely sure why buzz around "developer learns basic knowledge" has this on the front page.

9 comments

Good for you. But I think you're being uncharitable by failing to distinguish between "concept I didn't understand" and "thing I forgot to consider until I saw the problem it caused". The title also suggests the former, but I think the author is being a bit humble by underplaying his existing knowledge. Likely he actually did know what indexes are before; if you asked him to detail how MySQL foreign keys work he might have even remembered to say they add an implicit index. But it's super easy to miss that you're depending on a side effect like that until you see the slow query (or, in this case, high bill).

When you're programming, how many compiler errors do you see a day? (For me, easily dozens, likely hundreds.) Do you think each one indicates a serious gap in your knowledge?

Along these lines: imposter syndrome is a common problem in our industry. One way it can manifest is junior engineers can thinking they're bad programmers when they repeatedly see walls of compiler errors. I think it'd help a lot to show them a ballpark of how often senior engineers see the same thing. [1] I know that when I'm actively writing new code (especially in languages that deliberately produce errors at compile time rather than runtime), I see dozens and dozens of errors during a normal day. I don't think this is a sign I'm a bad programmer. I think it just means I'm moving fast and trusting the compiler to point out the problems it can find rather than wasting time and headspace on finding them myself. I pay more attention to potential errors that I know won't get caught automatically and particularly to ones that can have serious consequences.

I think the most important thing the author learned is that failing to add an index can cost this much money before you notice.

Ideally the author and/or the vendor will also brainstorm ways to make these errors obvious before the high bill. Load testing with realistic data is one way (though people talk about load testing a lot more than they actually do it). Another would be watching for abrupt changes in the operations the billing is based on.

[1] This is something I wish I'd done while at Google. They have the raw data for this with their cloud-based work trees (with FUSE) and cloud-based builds. I think the hardest part would be to classify when someone is actively developing new code, but it seems doable.

No you've missed my point, the author seemingly didn't know that ForeignKeys applied indexes by default in MySql. It's not "Concept I didn't understand", clearly they're capable of understanding because they did after they ran into the issue. It's about not having had basic knowledge to begin with.

But he didn't see compiler errors, he caused monetary cost to his employer.

When I deploy something that unintentionally causes a large monetary bill to my employer, then yes I do believe that indicates a gap in knowledge so I don't in anyway believe I'm being uncharitable. Or and this would be worse, a lack of caring. (Which is not what I think happened here though)

I won't respond to your imposter syndrome bit I don't really think it's relevant to my point.

> When I deploy something that unintentionally causes a large monetary bill to my employer, then yes I do believe that indicates a gap in knowledge so I don't in anyway believe I'm being uncharitable.

It depends, if you've been given a loaded footgun it's not entirely your fault when it inevitably goes off.

Let's go back to your "compiler errors" scenario, and let's say someone decided that the company should be using a cloud-based compiler that happens to charge per error. I wouldn't blame developers for falling into a trap that challenges all known assumptions.

The problem is that there is a DB that charges insane amounts of money per row processed with no upper limit and that someone actually thought it was a good idea to use it.

>The problem is that there is a DB that charges insane amounts of money per row processed with no upper limit and that someone actually thought it was a good idea to use it.

That's it in a nutshell. Usually you have an upper bound on compute, memory, disk space or some other resource for a specific price. When you hit those limits, you find performance issues and at that point you can choose to try optimizing your code or database, then decide whether you need to upgrade resources at cost.

I really don't understand this model that charges for rows read or, worse, "inspected". What's the upside of that model versus more typical pricing schemes, and how is it manageable/predictable from a budget perspective? With or without the indexing problem here, you'd really have to know your user behavior, then translate that to DB read counts by your app. And, while devs should all be optimizing code as much as reasonable, something as specific as minimizing DB reads seems an odd constraint to place on software.

I'm guessing there must be some use case I'm missing; else I don't know why this pricing scheme is even a thing.

I am somewhat shocked to find that an RDBMS is considered a “loaded footgun” in 2021. Perhaps grandparent isn’t the most charitable in their interpretation, but I am in full agreement. It continues to astound me how little about the basics of databases most developers know, and how strongly resistant they are to trying to learn.
An RDBMS that scales infinitely while charging you per-row goes against the usual assumptions learned in the past decades, so I'd say yes that's a loaded footgun.
Have a friend who had a BigQuery query that ended up costing $3k each time. It ran for only a minute, because BigQuery chews through data really fast. But you don't realize that when you push the run query button. And there's no spend guard rails. They switched to pay for given amount of parallelism after that.
First its not my "compiler errors" scenario it's the person who initially replied to me. Sure whatever, I don't think I ever insinuated I thought that was a good idea, it runs in parallel with the issue I have.
I think you didn't read through to this part of my comment:

> I think the most important thing the author learned is that failing to add an index can cost this much money before you notice.

> Ideally the author and/or the vendor will also brainstorm ways to make these errors obvious before the high bill. Load testing with realistic data is one way (though people talk about load testing a lot more than they actually do it). Another would be watching for abrupt changes in the operations the billing is based on.

No I did, but since I disagree with your earlier point about how much existing knowledge they have it kind of by default means I disagree with what they took away from this incident.

It's also highly speculative so like I'm not going to go back and forth on it.

Needing a vendor to hand hold your likely highly paid dev seems like a bad fix to me.

Also not having an index isn't an error it can be a valid choice based on your situation and query load which is why people should know the situations when they're needed.

I think people should simply be better. A lot of people don't like hearing that though so usually I keep it to my private chats where people seem more willing to cop to that fact.

I know we disagree, I know you're going to continue disagreeing, I know I don't want to have the conversation.

> I know we disagree, I know you're going to continue disagreeing, I know I don't want to have the conversation.

Please consider not chiming in on the next article like this then. I think your attitude of (paraphrasing) "no good programmer would have made the costly mistake you shared, and articles about it aren't worthwhile" is super harmful to our industry. It's the polar opposite of the blameless postmortem approach I'm fond of.

This one in particular is not worthwhile on the front page of HN, that's my take. They're most definitely useful for beginners, or maybe people just learning about databases.

I'm not going to not post simply because you find it disagreeable, there are plenty of people here who seem to agree with me.

Blameless post mortems are great, for your team. I am not his team mate, and I don't really feel a kinship with every developer under the sun. And for what it's worth I don't blame this developer for anything. If anything I lament the institutions that failed them on the way to this point in time. To me this is a symptom of systemic rot.

> I'm not entirely sure why buzz around "developer learns basic knowledge" has this on the front page.

The problem is that in the old days, not knowing about indexes left you with an underperforming system or downtime. But in The Cloud™ it leaves you with an unreasonably huge bill and that somehow as an industry we're accepting this as normal.

Which really is a head scratcher. You'd figure especially as a startup seeing a 5k oopsie isn't really as acceptable. Mistakes do happen and I don't mean any shade to this particular person (they'll never make this mistake again) but as an industry the aggregate consequence of this is you have a lot of waste and stupid choices that then have to be cleaned up when more knowledgeable (read highly paid) people are introduced later on.

They'll have to clean up the mess which causes real business consequences that, and I've personally seen this, will directly impact bottom line and have no quick or easy solution to wiggle out of.

Maybe it's acceptable for products like this because the balance between good engineering and company health probably aren't as cut and clear but stuff like this always makes me sad because it's such low hanging fruit, it doesn't require any real effort, just basic curiosity around your job.

I have no problems with a developer doing a 5k oopsie with things like card processing or an area that has a legitimate potential for direct monetary losses (such as payment processing where a bug could allow customers to order goods without actually paying).

I have a problem with whoever looked at <insert your favorite on-prem RDBMS here> and said "nah, let's go with a cloud-based solution that charges per-query and gives us an essentially infinite financial liability".

It's not so clear cut. What's the cost of losing the entire on-prem database? Do you trust a company who hired a developer who didn't know about indexes to hire a rock solid DBA? And how much does that DBA cost?
> What's the cost of losing the entire on-prem database?

Backing up an on-prem DB doesn't require specialist DBA knowledge. Basic UNIX skills are enough. Not to mention, since you're not in the cloud, bandwidth or efficiency is not a concern - feel free to rsync your entire DB off to a backup server every 5 minutes.

> to hire a rock solid DBA? And how much does that DBA cost?

They didn't have a DBA here either, and this "cloud" didn't save them. But at least with an on-prem Postgres the worst they'd have is significantly reduced performance*, where as here they had a 5k bill.

*actually the price/performance ratio for bare-metal servers is so good that a 100$/month server would probably take their unindexed query and work totally fine (as a side effect of the entire working dataset being in the kernel's RAM-based IO cache).

Rsyncing the database won't work in many cases, this doesn't ensure your backup is consistent. That is really, really dangerous advice, especially as you might not notice this if you test the process while the database is idle.

For Postgres you either use the pg-dump command and backup the dump or you setup WAL archiving and save base backups and the WAL files as they are created.

This isn't rocket science, but you really should read the manual at least once before doing this. Just copying the files is not the right way to backup a database (unless you really know what you're doing and are ensuring consistency in some other ways).

HA/clustered/replicated DB setups are not rocket science. Backups are not rocket science. Losing an on-prem database irrevocably never happened for me in 20 years.
This is not a matter of hiring an elite DBA. This is a matter of reading the manual. Both indices and backups are right there in chapter 7 and 8 respectively. But that's something worth doing irrespectively if you are running your own database or using someone else's as-a-service.
There are also many options in between "cloud-based infinity scale" and "on-prem". You can use cloud services that abstract many day-to-day operational tasks of db management, but are still bound price-wise to your monthly instance costs.
You can test backups, that doesn't require much expertise, only effort.
The best technical people aren't always the best to start a business. The goal is to make money not have perfect code.
If someone comes to me and tries to sell me a service that can leave me with an infinite bill I'd look at them funny and walk away. But that's just me and maybe I just don't get it and I'm not "startuping" right.
Cool, and if this was a case of bike shedding over something that hadn't just cost that early stage startup 5 grand I'd agree with you.

However regurgitating a platitude that everyone, including myself, learned when we tried getting our first business off the ground doesn't add much value here.

Had this been a Database with 10million rows it would have cost them 50k, and this is incredibly basic programming knowledge.

Basic proficiency is a far cry from worrying about best technical talent and not a particularly egregious ask.

I've been involved in some terrible software behind reasonably successful businesses. I complained like the best of them having to clean up the horrible mess. But it worked they used their limited competence and limited funds then built something profitable.
Yes we really should not accept this. The ability to impose limits on spending is key to control an enterprise. Whole security certification guacamole is based on having established controls. But where the bit hits the fan control is absent.
Using money to solve business problems is good business sense, but only if that’s the best way to spend that money. I agree with you that the status quo is normal, but nonsensical.
> But in The Cloud™ it leaves you with an unreasonably huge bill and that somehow as an industry we're accepting this as normal.

No. Nobody finds that "normal", that's just untrue. It's even the _whole_ subject of this blogpost: the bill was not normal.

I don't disagree that some people are overrelying on cloud services, but that didn't become normality, it's still a beginner's mistake

> it's still a beginner's mistake

Absolutely, but previously that wouldn’t cost you €5000 extra.

I've been using relational databases for web apps for my entire career and probably would have made this same mistake if using PlanetScale for the first time.

The author had two misunderstandings:

1) An index isn't created automatically

2) You get billed for the number of rows scanned, not the number of rows returned

Even if I noticed #1, I probably wouldn't have guessed at #2 for the same reason as the author.

You are absolutely missing the point. The point is not about indexes or full table scans, but it's a about cloud providers who will charge you for every row "inspected" and how a full table scan might cost you $0.15 and it would add up. It's not about slow performance which you can diagnose and fix, it's about getting an unexpected $5k bill, which you can't fix.

And in the end, if the cloud provider wants to charge you for rows "inspected", this can't be buried in small print. That's unacceptable!

The billing must come with up-front, red capital letters warning, and must come with alerts when your bill is unexpectedly little high (higher than expected, not just 10x or 100x higher). It must automatically shut down the process, requiring the customer confirm they want proceed, that you actually want to spend all that money. And it must be on the cloud provider to detect billing anomalies and fully own them in case it goes the wrong way. This is the cloud "bill of rights" we need.

You'd be surprised and frustrated. If you ever see someone say "We hired Oracle consultants and they are miracle workers" or "NoSQL is sooo much faster than SQL" you can be pretty sure they missed databases 101 and the requirement to add indexes.
> I'm not entirely sure why buzz around "developer learns basic knowledge" has this on the front page.

Because it's a well written, humble account of learning from a mistake then using it as an opportunity to teach others to help them avoid the same mistake.

If anyone leads a team, I hope they might learn from this approach, rather than just bashing on people and implying they don't deserve any attention because they made a mistake a more experienced developer might have dodged.

Personally I find it trite but whatever floats your boat.
One of the best database habits I've ever developed is to run EXPLAIN on every query that I expect to run repeatedly, then sanity-check the output. It's very little effort, and has prevented so much hassle.
If we weren't using underwhelming ORM DSLs, I'd love to use/write a github bot that automatically runs EXPLAIN ANALYZE on queries updated in a PR and post the query plan!
You can do that still. Just a bit more work to evaluate the ORM layer first.
Seriously. Like, every junior dev has to learn DB indexing basics sometime, and apparently the other of this blog post just did. But really can’t understand why this article is getting voted to the top of HN.
What I gained from the article wasn't that the dev was unaware of indices, it's that he didn't realise indices were missing due to how planetscale's database disallows foreign keys.

I never worked with a database that doesn't have foreign keys and it's not unthinkable to forget when you do for the first time, that foreign keys were what created indexes for you automatically.

A little bit of planning could have prevented that though :/

they were using some kind of foreign keyless MySQLish whatever thing