Hacker News new | ask | show | jobs
by kelnos 570 days ago
It's a bummer that you've been downvoted, because it really does seem like people here have not operated databases at scale.

I will never claim that we were great at managing databases at Twilio, but often a schema change would take hours, days, or even a week or two to complete. We're taking about tables with hundreds of millions of rows, or more.

We'd start the change on a DB replica. When it would finish, we would have to wait for the replica to catch up with the primary. Then we would bring up new replicas, replicating from the replica with the new schema. Finally that replica would get promoted to primary, with all the old replicas (and the old primary, of course) removed from service, and the new replicas brought in.

Only then could we deploy code that was aware of and used the updated schema. The previous code of course had to ignore unknown columns, and if we ever wanted to drop a column, we had to first deploy code that would stop using that column. Any column type changes would need to be backwards-compatible. If that wasn't possible, we'd have to add a new column and backfill it. Adding indexes would usually be fine without preparatory code changes, but if we wanted to drop an index we'd first have to make sure there were no queries still depending on it.

Even for a "small" schema change that "only" took minutes or a few tens of seconds to complete, we'd still have to use this process. What, do you think we'd shut part or all of a real-time communications platform down while we do a schema change? Of course not.

The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.

2 comments

Yep, sounds like we have similar experiences! I first had to start thinking about this stuff at Hootsuite, back in the exciting 1million+ DAU days a decade ago. Before then, to me databases were just a thing that got deployed along with the application, and deploys only happened on a Friday night so who cares about downtime? By the time anyone tries logging into the app on Monday morning, the code and database will all be up to date. Going to a place where deploys were happening constantly and nonzero downtime was unacceptable was eye-opening.

> The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.

These days my attitude is to treat databases as a completely separate service from the application code, which they effectively are. They're on a different set of servers, and the interface they provide is the columns/tables/views/etc, accessed through SQL. So yeah, no breaking changes, and the only thing application code should care about is if the queries it tries to execute return the expected sets of data, not if the schema itself matches. And certainly not about things like views, triggers or indexes.

This does end up being more overhead than migrations alongside the application code, which I know a lot of developers prefer because they're easier to use, but the approach just doesn't work after a certain scale.

(to be clear, I still use Liquibase etc to manage migrations, the process for applying those changes is just completely separate from deploying application code.)

> These days my attitude is to treat databases as a completely separate service from the application code, which they effectively are. They're on a different set of servers, and the interface they provide is the columns/tables/views/etc, accessed through SQL.

I've never thought of it this way, but I think this is really smart. If I have a service that exposes a REST API, I can, say, add a new field to a JSON object that's returned from an API endpoint without telling clients about it. Those clients can update later in order to take advantage of the information returned in the new field.

Same thing with a database: I can add a new column, and clients can learn about the new column later in the future, no problem. The database schema is just a part of the database's API, and it can be evolved in a backwards-compatible manner just like any other API.

> to be clear, I still use Liquibase etc to manage migrations, the process for applying those changes is just completely separate from deploying application code.

Right, the schema needs to be managed and there needs to be a source of truth for it, with tooling to do migrations, but coupling that so closely with the application so the schema and application always must be in sync (like some others seem to think is the One True Way) is a mistake, and would be a complete non-starter for my past professional needs.

IMO its a similar situation to the discussion here:

https://capnproto.org/faq.html#how-do-i-make-a-field-require...

I had to realize that at least in start up world most (non db focused) devs thing they might not be experts in SQL but have a very solid understanding

... and then don't know about a lot of very fundamental important parts and are blissfully unaware about that, too.

And to be clear I'm not saying they don't remember the exact details of something.

What I mean they don't even know that there are things they have to look up, nor any experience or willingness to understand what they did wrong by consulting the official documentation instead of just randomly googling and trying out "solutions" until one seem to happen to work.

The most common example would be having so little understanding about transaction that they believe transactions are just magically fixing all race conditions, and then then being very surprised that they don't. Or believing that transactions in SQL are fundamentally broken after realizing that somehow their databases got corrupted.

And again I don't mean junior deves, but people with 10+ years of backend or "fullstack" experience, i.e. people which at least should know that when to consult documentation/lookup protections transactions provide etc.

I have seen more then one time a (final state of) the situation where people started with believing SQL transaction magically fix everything, then get "corrupted" data then blame SQL for being broken and move to NoSql.

The joke here is all the concurrency problem are very fundamental and independent of SQL vs. NoSQL.

And SQL often gives you more powerful/easy to use (at small scale) tools to enforce synchronization, but at a cost. While NoSQL often gives you harder to use primitives where you have to do much more outside of the database to guarantee correctness, but then at least you will more likely blame you code instead of the db for things not working.

The most ironic thing here is I'm not a db expert, I just know where my knowledge stops and where I can lookup the missing parts and can't even give you much tips about huge dbs in production luckily surprisingly many companies have comparatively "small" db needs.

And honest where I see race condition related issues in SQL quite often I'm rarely not seeing them in NoSQL code. Where this issues in SQL make me sad as they are often very avoidable in NoSQL I often feel like giving up in resignation.

Through that experience is for "smallish" databases not Twillo scale. But a surprising large amount of companies have surprisingly "smallish" databases. Like no joke I have seen companies being very vocal about their "huge database" and then you realize it's like 5GiB ;=)

Honestly I (metaphorically speaking) don't even want to know how db experts feel about this, I'm not a db expert and just have a solid enough foundation to know where my knowledge stops and when I have to look things up (which is all the time, because I'm not writing that much SQL).