Hacker News new | ask | show | jobs
Old, Good Database Design (relinx.io)
182 points by jelnur 2102 days ago
17 comments

Some people choose nosql alternatives because they've spent time analyzing the performance of a proper relational model and have determined that an RDBMS will generate too much overhead for their data load and consciously accept the tradeoffs involved in giving up automated referential integrity.

Most people, though, choose nosql alternatives because they're too lazy to learn how to model data.

I am forever grateful that I took a full semester of database design in my undergrad. This single skill has stood with me for my entire career so far and has enabled me to figure out the root cause of many production issues. Plus people really like it when you can answer ad-hoc questions like "what inspections are still open and when were they first opened".

If y'all can understand Angular / React / Vue there's no reason to not learn databases.

Could you suggest resources (books, articles, videos, moocs or others) to learn good database design. I am picking up skills about sql but want to better understand and learn about databases. As someone who doesnt have that background, a lot of the times I am just googling for stuff and just trying out bits and pieces.
Apparently there's a 4th edition coming, but if you like the form-factor of a book-length text I'd throw a recommendation to "Database Design for Mere Mortals"[1]. I read the 1st edition from the late 90's, but I'd imagine it's still just as good. It approaches database modeling from a practical non-technical perspective, and I found it helped me learn data modeling in a software-agnostic manner, and later to influence how I talked to non-technical audiences about data modeling. I'm really glad to have found it early in my career.

I will echo what others have said. Data modeling is a force-multiplier type of skill. Combine it with a reasonable understanding of SQL and you can return a lot of value very quickly.

[1] https://www.amazon.com/Database-Design-Mere-Mortals-Hands/dp...

Not required, but knowing a little relational algebra helps queries make a lot more sense. A lot of the features like constraints and indices are very thin veneer over the underlying concepts.

At the other end of the spectrum, once you can write a few basic queries check out something like SQL Murder Mystery https://mystery.knightlab.com/

This one is super-fun and lets you practice some basic to low-intermediate skills.

Fun fact: You can do relational algebra on plain text data (e.g., csv, as well as output of commands like ls, etc.) using only shell + piping + well-known shell tools.
The Art of Postgresql. As the title suggests, it's targeted for Postgresql. However it has a full section on data modeling, another section on data types. It's packed with lots of examples and reasoning. I feel it gives you a good mix of theoretical background and hands on immediate experience. You'll find your way around what is (in my worthless opinion) the best database engine, and the pgsql specific examples will mostly translate to other relational databases.

https://theartofpostgresql.com/

I'd like to know this as well. I think you'll just have to build things (potentially horribly) and fail.

I took three semesters of database (granted, baby database classes) and I still have no idea how you can do something pretty straightforward like creating a room reservation system.

If there is a reservation beginning at 10:15 AM and ending at 12:30 PM and someone tries to book a reservation from 10:00 AM to 10:30 AM, the transaction should fail.

and before someone screams db2! yes, db2 can. but then you'd have to use db2 https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/co...

Why is this so difficult...

The documentation for Postgresql range types describes how to do exactly this.

https://www.postgresql.org/docs/11/rangetypes.html#RANGETYPE...

Edit: and if you didn't want to use postgres, you could have "starttime" and "endtime" columns and reject any bad bookings with a before insert / before update trigger.

This approach is the best and works really well if you don't need to do a join on a related table to look up information. If you need to use data outside of the current table for exclusions/check constraints, you have to write a trigger function (as far as I know).

I had to solve this recently, where the actual start/end times were stored on a related table. I'm no SQL wizard, but I'd love to share my solution in case it helps others (it might be terrible).

Note: I changed the actual tables/domain to be generic, this is a poor example and it made more sense for my usecase, but this shows general approach.

  -- Let's pretend we have these tables (awful design, but for sake of example):
  -- room <-> reservation <-> reservation_info
  -- Where "reservation_info" has "start_time" and "end_time"
  CREATE FUNCTION check_for_overlapping_reservations()
    RETURNS trigger
    LANGUAGE plpgsql AS
  $$ BEGIN
    IF (
        -- Find the newly created reservation and join it with the info record to grab "start_time" and "end_time" for check below
        with this_reservation as (
              select * from reservation
              inner join reservation_info on reservation_info.id = reservation.reservation_info_id
              where reservation.id = NEW.id
          ), bookings_for_timerange as (
              -- Select every other reservation, where the reservation is happening in the same room
              select * from reservation as other_reservation, this_reservation
              inner join reservation_info as other_reservation_info
                on other_reservation_info.id = other_reservation.reservation_info_id
              where other_reservation.room_id = this_reservation.room_id AND
              -- And the timerange from start to end overlaps the newly created record
                  tstzrange(this_reservation.start_time, this_reservation.end_time) &&
                  tstzrange(other_reservation_info.start_time, other_reservation_info.end_time)
          -- Get a count of all the records, it should only be 1. If it's greater than one, there's overlap.
          select count(*) from bookings_for_timerange
        ) > 1
    THEN
        RAISE EXCEPTION 'Room is already reserved during this time period';
    END IF;
    RETURN NEW;
  END;$$;
hold start & end time in normal DB columns, use business logic for your rule enforcement, have a query to identify rule violation. Maybe add constraints on your warehouse data but I probably wouldn't on the operational DB. I could be convinced to do so for the PoC because it might make sense, but what if you're managing thousands or millions of scheduled "resources" like a chain of libraries, or hotel rooms? What if you need to coordinate "leases"? DB-level is an OK place to start, but also one of the hardest to tease apart when you need to scale.

This is kind of the point of data modeling; you start with an idea, make it all pretty 3rd-normal form, define your projected indices and checks and constraints, then you mess it up a bit where it makes sense or you've had experience in the past.

YMMV

Exclude constraint on a GiST index?

https://stackoverflow.com/a/51247705

Just about anything written by C. J. Date would be worth your while in this regard. It will give you a solid grounding in the relational model, which you can then apply to any RDBMS. Here's a good list to start with:

- Database in Depth, O'Reily (2005)

- Relational Theory for Computer Professionals, O'Reilly (2013)

- SQL and Relational Theory, 3rd Ed, O'Reily (2015)

- Database Design and Relational Theory, 2nd Ed, Apress (2019)

If you can find a copy of this, I'd recommend it "Conceptual Schema and Relational Database Design: A Fact Oriented Approach" by Nijssen and Halpin. They presented a richer model (NIAM) for designing databases than the ER model by Codd/Date, though CJ Dates book is useful to. Both of these are out of print now, but anything by Halpin or Date would be a useful addition. To design databases relational theory and normal forms are the language you'd need to understand - they're not very hard though seems they aren't as widely taught any more.
Pardon my ignorance--is inspections some concept that relates to database management, or are you referring to a query like "select * from inspections where status = 'open';". Honestly asking.
I believe OP is just referring to the ability to run adhoc queries. Typically nosql solutions are built to be performt for common access patterns, at the expense of being difficult or impossible to query in unplanned ways. SQL DBs are very good in that regard.
Thanks! And yes, I agree. The ability to write ad hoc SQL is a great skill.
Totally. I’ve always admired SQL wizards and their ability to (quickly) produce the most optimal queries given the relatively limited syntax (compared to what you can do with frameworks like, for example, LINQ today).
Ah yes, inspections in this context means Safety Inspections, aka a business rule or thing. Industry is Retail / Grocery. Inspectors inspected food at stores on day one and Stores had 5 business days to “complete” the inspection. Internal audit wanted a list of “defaulted” inspections so they could go harass the store people to get their things done.
Absolutely agree. Learning data normalisation was up there with Typing 9 (touch-typing class in grade 9, not Hindley-Milner whatever) in terms of long-term usefulness.
I feel the same way. It’s been nearly 30 years since my undergrad database class and I used what I learned in that class yesterday mocking out a schema for a personal project. We didn’t touch any real dbms in the class but rather did most of the class work using relational algebra.
Is "inspections" a typo for "transactions" here?
I speak as someone who's worked for 30+ years on data modelling. Every time I encounter some mongo or other non-relational DB where the company jewels (the data) are stored with no documentation, no data model etc and stuff is just shoved into these stores willy nilly it makes me weep.

Start off with relational, if perf is a problem then look at denormalising, after that then consider other alternatives for special cases. But to see run-of-the-mill apps with no near future scalability issues jumping right into mongo et al from day one makes me want to run away.

> have determined that an RDBMS will generate too much overhead

I think the read/write overhead is mostly a function of schema design, rather than an intrinsic property of an RDBMS. Denormalized schemas have similar performance profiles to document-oriented storage.

Moreover, mainstream SQL databases like Postgres are getting better and better at indexing jsonb fields, indexing time series data with BRIN, rudimentary full-text search, offering a one-size-fits-all storage system that may not be the best at everything, but will be good enough to support a growing business to its next funding round.

for text search the rum index is quite robust. my guess is that rum (or something like it) will be introduced into the pg core soon. we index many terabytes of pdf files excellent performance.

   https://github.com/postgrespro/rum
also, postgrespro are behind the json/b indexing.
Those two aren't mutually exclusive.

If you don't know how to put in the hard work to get something like you want 'the right way' and someone keeps advertising that they have 'another right way' that can do all the same things, you can take the low road and by the time you see the trap you've walked into it's difficult to go back, or for some people even admit that mistakes were made.

There are all of these time horizons in Software Engineering where you externalize the costs of things to your successors. Several classes of bad decisions start to fall apart at 18 months, and unless those started at the same time you did, you might be able to grin and bear it until it's time for a new job anyway. I think there are classes of process or architecture missteps that might go 3 years (less confident of that time interval) before it comes off the rails, and for a lot of startups that could be the proximal reason you flame out, even though it's the symptoms that are more visible.

Or, like myself, they chose NoSQL because it was much better integrated into their development and production cloud infrastructure than any alternatives.
Example or it didn't happen
AWS appsync, firebase, etc. Both nudge you to use their proprietary nosql database through integrations.
I think relational databases have largely failed developers because they don't provide the features they actually need.

A common question that comes up is how to do zero-downtime schema changes. The answer is that there isn't one. A correct implementation would store each schema version in the database and when an application connects, it would specify which version it's speaking. The developer would supply a mapping on how to make vX data available to a vY program. But no relational database supports such a feature, so people are forced to tread carefully -- look at all the deployment software that exists to attempt to find changes with database migrations and treat them differently. Look at all the software people have written to even apply those migrations. It's staggering, all because in the 70s when these systems were designed, the thought of deploying your code multiple times a day was unheard of.

Another problem that comes up is transactional isolation. Most engineers, and even casual practitioners, "know" that transactions exist for cases where you want to perform multiple operations atomically. But very few of these people are running the transaction with an isolation level that provides those guarantees. They will write their program assuming that transactions are strictly serializable, but in fact they are using "read committed" or some other weak form of isolation that totally breaks their assumptions. Then the database gets into a weird state, and people are baffled as to how that could happen. The actual implementation is so different from the CS assumptions that it's not even something that crosses people's minds, and "read committed" behaving as "read committed" looks like a heisenbug in the rare case they actually notice what's going on. That's super bad.

These underlying problems have nothing to do with SQL or NoSQL, though. NoSQL can smooth over schema incompatibilities a bit (perhaps the schema specifies everything as a "field tag" instead of a name, so you can safely rename columns, or perhaps everything is "optional", so the application can detect that it's reading an old record when it's missing; you can also easily build your own versioning system on top because the data doesn't mean anything to the database engine itself), but you can still get yourself into a lot of trouble. The NoSQL databases also have a horrifying transactional cleanliness record. Postgres may be "read committed" by default but at least you can get real transactions if you ask for them; good luck ever getting them with some NoSQL databases.

I guess where I'm going with this is that database engines are focused on the wrong problems. The relational model is very good. But it's something you can bring yourself once you have a way to transactionally read and write keys with opaque values. You can also add indexing at the application layer, or triggers, or encryption, or auditing, or RBAC... whatever, it's just code. At the end of the day, picking a relational database just gets you a VERY opinionated set of defaults that is unlikely to be what your application needs and nearly impossible to change later... but the defaults are juuuuust good enough that nobody makes a real effort to change them. Meanwhile, we ignore the problems that actually plague developers; schema versioning, unusual data types (time series, large blobs), availability, replication, etc. The operational concerns have been ignored for decades, and it's slowing everyone down.

People are right to be looking for alternatives, even though we know that most of the alternatives have even worse problems. Someday, somewhere, someone will get it right.

Good post. I think it is also worth mentioning that the relational model has a few intrinsic performance issues and gotcha's that require some uncomfortable and creative wrangling.

Specifically, safely handling concurrent writes to different rows that are related to each other requires careful consideration, and when you have a join with clauses on both tables it is not possible to have perfect efficiency without multi-table indexes.

I think that these are not intuitive, but solvable. The troubles arise when an application doesn't consider these issues until after it hits scale (which there is sufficient concurrency and enough millions of rows for these problems to rear their heads).

All depends what you're doing, right? Is it a small stupid app with a short lifespan? Is it "just" a proof-of-concept? Is the company likely to survive long enough that technical debt matters? (oblig. HN/SV/startup comment) Is there a ridiculous imbalance between data ingestion volume vs. reads/retrieval? All good arguments for alternative/noSQL approaches.
Yes. And then you‘re app grows and now you have an undocumented NoSQL database that would better have been a RDMS...

There‘s nothing wrong with starting the proper way.

I am imagining an industry where half of developers don't know how to design a relational data model. Scary if true.
I'd be surprised if it were only half.
If one of the purposes of relational databases is data modeling, I've always wondered why there aren't good semantics for sum types. The real world is full of them, but databases can't express them. When I bring this up, some people respond that this is the purpose of ORMs; however, this implies that we have an arbitrary bifurcation in which some of the processing happens efficiently in SQL and anything that depends on sum types has to get hoisted up and over a network to application code. Further, it allows for different clients to behave differently, possibly having different notions of what any given sum type's variants are (which leads invariably to data corruption). I really wish databases did better here, but maybe I'm missing something.
There's no real reason you can't have sum types in a relational database, this just goes back to our current crop of RDBMSs being quite old and predating the current emphasis on strong typing. I suspect a modern relational database with an easier way define complex types (not writing a plugin) would be very popular, but the amount of work required to make something like that is immense
That matches my intuition. The more surprising thing is that this doesn't seem to be talked about very much. I've literally never heard anyone else advocate for a relational database with support for sum types (I'm sure they exist) which has caused a lot of introspection about whether or not there's something about RDBMSs that I just don't understand.
it came and went, postgres and sql server were (and still are) referred to as "object-relational" because of their support for rich user-defined complex types. it's just not that practical real-world because we tend to start with first normal form which complex types violate naturally.

> whether or not there's something about RDBMSs that I just don't understand

perhaps if you consider the entire result of some query to imply the type of the thing being modeled rather than any given table or cell, then it's more obvious that the basic types supported aren't much of a limitation? afterall you can always include a "type" indicator (or project one from a more sophisticated relational model).

Normal form supports complex product types just fine. A row is an object and a row containing a reference to a row in another table is equivalent to an object containing another object, and of course a result set is a list.

But I’m not talking about product types, complex or otherwise. I’m talking about sum types (also known as Algebraic Data Types).

> A row is an object and a row containing a reference to a row in another table is equivalent to an object containing another object

if you're already thinking of it that way then it's a small jump to the result set as a set, that can participate in a union - and there's your sum type. add or project a type indicator and it's literally a tagged union.

Three techniques for sum types in SQL (absorption, separation, and partition):

https://www.parsonsmatt.org/2019/03/19/sum_types_in_sql.html

Also, here is a relational database system with native support for sum types (and also no "NULL" nonsense, which is also not part of pure relational algebra):

https://github.com/agentm/project-m36

Oh, very cool. I'll check those out. Thanks for sharing!
I tend to agree. Algebraic data types (product and sum types) is essential for modeling. Missing sum type would be incomplete in terms of modeling.

A lot of obscure OO code I have seen is related to modeling sum types of products (or parameterized enumerated in some language), or the lack of it in mainstream languages.

The problem applies to SQL as well. If I make everything into the same table and leave many columns empty I’m not modeling at all. But if I split them into tables it defeat most basic SQL functionalities like SELECT * FROM my_sum_type.

The traditional relational model is very focused on mutable data and normalization. Different types would be categorized in separate columns. So this idea would run counter to "best practice" and need something foundational behind it, which would be just enough out of scope for a traditionally typed relational datastore. Maybe this is just another way of saying the underlying theories are different, or covering different areas of computation and storage. So the prevailing wisdom is to not expose type complexity explicitly to clients, but just export the inherent problem implicitly.
It seems that the relational model plainly enough wants to be the gate keeper for your data model—it gives extensive tools for modeling and enforcing data schema, but it just kind of throws its arms up at data that is “OR” shaped. Some people argue that it’s because there’s not an obvious way to lay out sum type data in memory or on disk or on the wire, but these problems are all solved by traditional programming languages (data is data, at the end of the day). If you want to take the “different philosophies” tack, then it seems like a philosophy that only addresses AND-shaped data leaves a lot to be desired.
I'm interested in understanding what you mean. What is "OR" shaped data? Are you thinking of data like, "The staff member must have either a salary or an hourly rate"?

Typically I would see this modelled with two db columns, with a DB constraint indicating that only one of these can have a value.

That’s generally what I’m talking about, and it works well enough for simple cases, but these sorts of solutions tend not to scale well and you give up type safety. Not the end of the world, just disappointing.
Sum types are harder to express as data integrity constraints that product types - for example, Horn clause constraints can be used to express that a table is a product (or limit, more generally) of other tables, but to express that a table is a co-product (or colimit, more generally) requires a stronger logic (handling disjunction, for example). This has implications for data processing algorithms such as the chase that depend on the strength underling logic.
Can you give an example of real world data modeling where you want more expressive sum types over just using enums? Enums are technically a subclass of sum types, but even those are non-trivial to use at a data format level (Try evolving them in an on-the-wire message format like Avro or Protobuf).
Imagine a system that allows third-party login (Facebook / AppleID / whatever) - then the account has either a username/password hash or an oauth token or some other kind of structured data.

Delivery addresses for a system that supports both physical and digital products - you want a type-level distinction between physical and digital addresses, but an order might be being shipped to either.

Subscription vs free trial - they're different kinds of thing, but you want to store more details (e.g. expiry date) than just an enum of one or the other.

>Imagine a system that allows third-party login (Facebook / AppleID / whatever) - then the account has either a username/password hash or an oauth token or some other kind of structured data.

What I've seen most often is you have to deal with account merging but lets say you do really want either/or...

Wouldn't you just have your third party tables (each with their own idiosyncrasies) and in your user table you'd have login_type and login_id columns? You know which table to hit by type using the id?

> Wouldn't you just have your third party tables (each with their own idiosyncrasies) and in your user table you'd have login_type and login_id columns? You know which table to hit by type using the id?

You can do that but it's a bodge. E.g. you won't be able to have the foreign key constraints you'd normally have on that login_id column. And good luck writing a query that actually does something differently for each case - you'd have to do something like multiple left joins and there's no way to check you've handled all the cases and not done one of them twice.

You can have a table or view of all the ids across your implementations and FK into that.

As for how you would model it in the application, in this case you can just normalize across all the possible columns and have the orm build out your mapped object.

How do you model "postal address"? Some postal addresses are PO Boxes, some are street addresses, etc. There are canonical representations of these different cases. Do we just shove it all in a string, and let the application perform domain validation?
Don't even try to do validation on postal addresses. The postal system has so many corner cases that you'll never be able to correctly handle all of them. Every mishandled corner case will cost you, or your counter-party, time and money.

Just dump addresses into a unicode string and let the postal system figure it out.

postal address is one of those cases where you probably do just want to shove it all in a string as most structural constraints eventually backfire - especially if you support international: http://www.columbia.edu/~fdc/postal/

the most common schema I've seen is usually something like line1, line2, line3, city, state, country, zip, etc. if it's a reporting database then city/state/country/zip is often mashed into some sort of location id.

Watch out for `state` as well, please don't make it mandatory like so many websites. No, a region is not the equivalent of a state in France, you don't need it for my package to get there!
Each type of postal address is a separate column. New postal address "types" would get new columns. This works particularly well when addresses can have both PO boxes as well as street addresses. This is actually more flexible than tagged unions/sum types, at least for this particular case.
So a PO Box address or a house address would be concatenated into a string value and then stored in either Addresses.POBoxAdress or Addresses.House? It’s still not structured. How can someone easily get the postal-code/zip-code?
I think they may mean that the result set has elements of different types. For instance if you stored restaurants by genre but wanted a list of all restaurants, but retaining all of the unique fields, you currently need to generate the product type of the genres.
Hmm, is it that hard?

    Vehicle table -- ID, TypeId, Make, etc.  (123, 456, ...)
    TypeId table -- ID, Type (456, motorcycle)
    Motorcycle table -- ID, HandleBarStyle, etc. (456, Low Rider, ...)
    Automobile table -- ID, TrunkSpace, etc. (789, ...)
You can pretty easily add extra information to any id as long as you know where to look, and that can be a simple enum column to define the concrete type (and thus what data to grab). Its an easy enough join, isn't it?

The data modelling isn't the hard part really. Pulling it into an application in a nice way is probably harder. I don't often use languages with sum types professionally so maybe this is way off base but I don't see an issue.

How does the vehicle table ID column enforce referential integrity with the primary keys of the other columns? In general, there are a lot of issues that come up with respect to type safety and query semantics when you use these workarounds. They’re okay, but disappointing.
You can enforce that the column is non-null and has a valid value id in the relation table. You can also enforce that your motorcycle and auto tables are referenced in the relation table or make it a view of all the ids across the types (conceptually anyway, I'd have to look into the perf). It's a contrived example but what is the issue? What are the drawbacks? These are pretty easy constraints to add.
If you drop a row from the motorcycle table without CASCADE does it error if the vehicles table still contains a pointer? If you drop it with CASCADE does it also drop it from the vehicles table?

Honestly it’s been several years since I’ve done standard app development, so I’ve forgotten some of the specific problems that I would run into. I just remember that I tried all of the gimmicks to emulate sum types and they were always clumsy. Not a very satisfying rebuttal, I know.

What's difficult about them? I typically use nullable columns and then a check constraint to specify a custom condition for nullability. Columns belonging to the same alternative in the sum type must be all null or all not null. And then there's check only one active alternative.
This approach doesn’t scale when business requirements change regularly such that you need to add or remove columns to an existing table.

Adding new columns by creating a new table is easy and cheap and doesn’t involve downtime. Adding 100+ columns to an existing table because the spec said a relationship went from 1:0-1 to 1:1 is a pain.

(This can be avoided with creative design with deferrable constraints, something certain major RDBMS are still lacking, gah!)

> Adding 100+ columns to an existing table

Whaaaa 100+ columns? How often do you need to add 100 columns to an existing table?

I mean honestly that has to be as rare as chicken teeth.

>>Whaaaa 100+ columns? How often do you need to add 100 columns to an existing table? >>I mean honestly that has to be as rare as chicken teeth.

Haha, and yet it happens. One of the worst databases I ever dealt with was some ridiculous variant of Universal Model for a SCADA application, the databases were named D1, D2, D3, and so on; the tables T1, T2, T3, etc; the columns C1, C2, C3, you get the picture I hope...when they hit maximum column count they would create a new table and bizarrely rebalance renumbered columns between them. In database D0 there was a catalog with all of the details relating back to the actual model. The system was actually implemented at some defense contractors and power system operators, I remember someone at Lockheed whining about hitting the number of databases limit for Sybase.

Hopefully nobody has to deal with that crap any more.

Oof I feel your pain
Whoah I've never dealt with tables with more than 20 columns. Can't imagine the shops that require 100+ columns.
Type safety And query semantics, mostly. You can use constraints to get some of this back, but it only goes so far. Ultimately there’s a reason statically typed programming languages developed sum types, and all of those reasons apply to databases as well because data is data.
you can avoid nullable cols by using separate physical tables for each concrete type with surrogate PKs from a common sequence. these can be concatenated back together in a view (with no performance hit if you're careful not to hide indexes and predicate pushdowns). FKs then can reference the appropriate concrete table and in case you need an FK on some union you can use an indexed view instead.

I've used this approach for modeling entities that are polymorphic in object-land and haven't felt particularly underequipped - more robust indexed view support would be nice but that's active research territory (one of the hardest problems in computer science, materialize.io looks promising!)

> If one of the purposes of relational databases is data modeling

Huh?

Creating a data model before creating a database, is like writing an outline before writing an essay. It organizes your thoughts and gives structure to what you are about to do. Once you have a data model, you can then implement it using whatever database technology you choose.

If you don't start with a data model, you literally don't know what you are doing.

I think you misunderstood my comment. I’m arguing that relational databases would be more useful if they had sum-type semantics. Data models often have OR-shaped data, and pretending like this class of data doesn’t exist and making it the purview off application code makes relational databases much less useful than they might otherwise be.
> A well-thought design can save us many hours of coding, testing, and troubleshooting.

That is the very definition of a waterfall design model.

I've turned into a fluid-design advocate over the years, where every design principle follows a next question - "okay, this is good but how would I change it?".

So you start with a unique constraint and four months later, you find out that it is not actually unique (like "two patients with the same email, because the 2nd one is a newborn on day of birth").

Or you normalize a data-set only to find out that your 1:N relationship turns into a 2:N relationship from before/after dates (like "UK" goes from "EU" to "UK").

The lost-time work of a design is usually the "okay, we did it in a world where UK was in EU - but we undo it & here's how" notes.

Having a plan to decommission the nuclear plant you're building is super useful and often more relevant than designing it for efficiency alone.

Good design as advocated by the author is generally conducive to change and is saying pretty much the same thing as you are.

It's a lot easier to take away unique constraints later on instead of adding them in. It's easier to de-normalize some data for performance than to normalize it later on. The list goes on.

The reason the waterfall method received so much bad press is because of requirements gathering, not the software or data design phases. Requirements are hard to get right the first time and they also change over time. But I'd be surprised to find someone argue that good architecture and design is a bad thing (being defined as the ability to adapt to changes in requirements).

> The reason the waterfall method received so much bad press is because of requirements gathering, not the software or data design phases.

No, it was because all three were done wrong.

Requirements gathering is the biggest problem, true. But even if requirements were both knowable and fixed, for most projects, big up front requirements gathering, design, and then implementation would have lots of waste in the lean sense of effort expended that spends time not delivering customer value.

Now, that gets made worse with the rework created by the fact that requirements gathering without validation by use gets lots of stuff wrong and that the context is often evolving such that requirements will drift between gathering, design, and implementation in a waterfall project, so that lots of work is done which never delivers value and needs reworked before it can do so, but the problem exists even without that exacerbation.

Well-thought design saves time even in agile projects. Maybe it's a one day activity instead of three months of it and yet it makes a difference. Basically each activity (or whatever we call it) is a micro waterfall.

Example from today: a developer came back from a week of vacation, listened to the stand up meeting this morning and pointed out that we misunderstood the purpose of a table he worked on time ago. Result: a few hours of last week's coding were useless and we spent a couple of hours together at redesigning the activity. The total impact should be of about one day.

> That is the very definition of a waterfall design model.

I think you're conflating two different things. I've spent the last two days thinking about the impact of adding five new tables to our database for a feature, two of them just lookups. I've thought about the short term benefits and the long term possible problems (for something that isn't even spec'd yet).

This isn't "waterfall" this is just getting a design about right for the current circumstances.

If during the design phase you're already asking "okay, this is good but how would I change it?" then you're already waterfalling your design.

Being thoughtful about your design up front (waterfall as your claim) hopefully solves many a problem down the line when you've suddenly got 5900M+ rows of data in the wrong shape in your production database; because a production database of that size doesn't take kindly to being "agiled" around.

When you need to add info to your database, do you refactor existing tables or do you add a key-value table.

I've seen so many key-value tables that really needed to be refactored into proper normalized tables.

It's just so tempting to stick extra info into KV when you don't know where the project will end up.

> "okay, this is good but how would I change it?"

Otherwise known as Reversible Decisions. Any decision that can be undone easily does not require the level of scrutiny, the level of investment. Save that investment for the things you won't be able to change.

When faced with an irreversible decision, it's helpful to develop stalling tactics. Everything from distracting people with other issues to finding a way to get a 'taste' of the change without committing to it. It also helps if you stay on top of release notes for tools you use, and competitors of those tools. New opportunities might arise to use someone else's work to solve your problem better.

Are you presuming that a 'well-thought [out] design' means that all the thoughts, and the design, came only at first? I can see a well-thought out design either as the waterfall you presume, or as an accretion of design decisions made over time as in your 'fluid design.' Either way can save many hours of coding, testing, troubleshooting.
> That is the very definition of a waterfall design model.

No it doesn't mean that. When you can rely on your data it's easies to code and test than when you have to account for every bad data combination.

And remove the constraint is easier when you need it, than to add it later.

Thanks for the feedback!

It's easier to safely loosen constraints than to add them post facto.
Agile needs design too. Agile does not mean you don't do any up front work or any design work, it just means you have agility to adapt to change as you progress. A well thought out design often helps you be more agile, not less.
My least favorite part of database design is the bit where you have to pick lengths for your char columns.

Twenty years in and I'm still picking these pretty much by guessing. And when I guess wrong it causes really annoying problems further down the line.

I love how SQLite doesn't make me do this - it just has a TEXT type which is always unlimited in length.

I suppose that in every RDBMs that makes the distinction, it's an optimization matter - VARCHAR being stored on-page, while TEXT off-page (although there can be optimizations for short TEXT values); the latter will cause an extra page seek on access.

Some database [versions] may also be unable to apply certain optimizations in certain cases, when TEXT is used (eg. temp tables on MySQL <= 5.7).

That doesn't prevent one from always using TEXT, and possibly, for most of the use cases (surely, if one uses SQLite, that's the case), the performance impact is not meaningful.

For what I understand in the docs, in postgres at least, VARCHAR and TEXT are the same thing, and CHAR actually has the performance hit.
They are implemented the same, but they aren't the same type. Notably, each parameterization of varchar (or char) is its own type. This can cause issues when trying to change the length parameter. For this reason I prefer to use TEXT with a CHECK constraint.
What's wrong with TEXT type for postgres, mysql, etc? In Postgres you don't need to declare a length for varchar either.
Yeah, in fact, Postgres encourages the use of TEXT over VARCHAR. The documentation even states:

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.

https://www.postgresql.org/docs/current/datatype-character.h...

Postgres also has a TEXT type like this.

Edit: I see mpolun left a similar comment here, but it looks like he has been mostly shadowbanned for about 8 years.

> Edit: I see mpolun left a similar comment here, but it looks like he has been mostly shadowbanned for about 8 years.

Yeah same. I don't have enough karma to vouch for their comment but I see no big reason that said comment should be dead. Their account is from 2012 and the vast majority of their few comments seem to be contributing to the conversation. Their first dead comment is also from 2012 but at a glance said comment is the only one that stands out as not contributing much to the conversation. And that's being harsh even – I've probably made less substantial comments in the past myself. Yet like 30% of their 3 pages of comments are dead. And looking at their submissions they have ever only submitted 3 stories, 2 of which appear to be from a domain that they themselves control. Hardly enough to be subjected to having so many of their comments killed I think. Though of course there might be other factors at play, but from what I see on their profile page I see nothing bad enough to warrant this.

I have 1500 karma, but I didn't see an option to vouch for it... I have seen that option on other comments, though. Maybe dang can take a look at their account.
Just for general interest, how can you see their comment or assert they have been shadowbanned?
> how can you see their comment

Go to your profile and ensure you have showdead set to yes.

> assert they have been shadowbanned

See https://news.ycombinator.com/threads?id=mpolun and check out how many of their comments are showing as dead in combination with the contents of said comments. (The step above about setting showdead to yes might be required before you follow said link in order to actually see the dead comments). Almost every single one of the dead comments is contributing to the conversation. This is indicative of a shadow ban. HN users would not be downvoting the vast majority of these comments I think.

There's always a limit. You either define and manage it yourself or it'll be done for you when some part of your system breaks. In the end if you actually need performance and reliability everything will have a bound (if not fixed) size and larger data will be processed as a stream, anyway.
Postgres tip: define columns as TEXT, but with a CHECK constraint that the length is what you expect. This avoids the problem with varchar(...) that views inherit the underlying column types (including the length specifier), and prevent you from changing the maximum length in the table unless you drop and recreate the view. (There are ways around this through system tables but they are wholly unwieldy.)
I really like that - especially since changing CHECK constraints can be done on large tables without having to rebuild the entire table.
Just use TEXT in other databases as well. It really shouldn't matter much in modern dbs
Why don’t you use varchar(max) as the range always. The varchar data type specified that the length of this attribute is variable in each record and the memory allocated depends only on the number of actual characters stored in the column.
That’s a bad idea: that pushes the burden of data validation entirely on your client or application code. Textual column lengths should be used to enforce sanity checks on data.

I’ve worked on more projects than I care for which had nvarchar(max) columns for storing the contents of a small 3-4 line HTML textarea: most users were expected to type in less than 100 words or copy-paste the output of another program.

One day, that other program had a bug that made it generate about a gigabyte of textual output. That program had a “Copy output” button so the user didn’t realise how much data they were copying. I don’t know how it didn’t timeout when it was inserted, but that user brought the system down for everyone because that gigabyte-sized text value was used in lots of places.

>>That’s a bad idea: that pushes the burden of data validation entirely on your client or application code. Textual column lengths should be used to enforce sanity checks on data.

It just doesn't work out that way in practice. For example SQLite, the most popular RDBMS of all time, pisses on strict column type and gives you value type instead with hints and storage classes. For several large systems I architected in SQL Server and Oracle, I gave developers heuristics to follow for column type selection and in some cases strictly re

I compare it to "defence-in-depth" (i.e. data validation at every level).

Ever since RDMBS moved-away from XBase-style table-files database textual/string column length limits are conceptually the same thing as having a CHECK CONSTRAINT on the length of a string (something that SQLite does support!).

e.g. https://stackoverflow.com/questions/8252875/how-to-restrict-...

I think the CHECK CONSTRAINT is explicit and intentional WRT data validation. However it does not make a different type out of the value. I don't really think there is a one-size-fits-all for this problem anyway. The real problem expressed is accepting too much data - a check constraint requires the entire value to be present before validation. So if too much data shows up and there is a weak link in the data processing system, T.U. as they say...Many libraries provide for controlling how much physical data can be sent without resorting to a large data API.

Most character values in a database are not really text entered by a person, it's typically a symbol for some kind of enumeration or key referring to external semantics, e.g. compass direction "N", "NNE", etc. Or street names or postal codes or country names and so on. You can either model this directly or if it makes no difference, accept whatever you get. Actually examining the data can be interesting; I noticed once in a database with 50 billion stored dollar amounts there were only about 50,000 unique values. This led me to assert the dollar amounts were actually symbols in the system.

Sorry, my comment was truncated in my original post and I didn't notice it in time to edit it.

Varchar(max) is fine for some cases but you can't use it for everything because there's some limitations compared to varchar(1000) because the latter is stored in the actual table structure while the former is not.
With postgress it doesn't matter. Chars are just varcars under the hood, so you should always use the latter.
Until some user inserts 100MB in a single column of a single row.
I'd rather guard against that in my input validation application logic than pick the wrong value when I'm designing my table and have to deal with fixing that later on.
Don't the other ones have TEXT, too?
Same with postgres. I never use specific lengths for text on postgres.
Haven't you seen ghostbusters? "Don't cross the streams. It would be bad." If your data model has lines crossing over, it's the first smell of a bad design. Seriously tho, 5 minutes of untangling would make that data model diagram 100x better.
I wonder if there’s actually any significance to if your data model forms a planar graph
Reminds me of this great Derek Sivers post: https://sive.rs/pg

If your design is good, you need less code.

>If your design is good,

"Draw the rest of the fucking owl"

Good design is hard. There are arguments to be made for both, but the problem with "Old, Good Database Design" is when the design changes it either devolves into

1. Downtime trying to move X billion rows

2. Some ad-hoc K/V store on top of your RDBMS

And most companies tend to opt for (2) rather than (1). It's no surprise that some systems just decide to choose (2) from the onset.

Problem is that any changes to the databases need to be done on a single point of failure for your application. Any change that goes sideways, you risk downtime.

Counter this with code, where I can deploy code along side my existing code and make sure it functions as I want it. If something fails, I just remove that instance from the LB. You don't want to touch something that valuable during regular feature cycles. Databases should be altered rarely and with much apprehension and a well established backup/rollback plan.

I'm not saying you can't use constraints and stuff, but they should be really really static concepts that aren't subject to change. Unlike code, you can't share validation conditions across tables, so where you could update all the validation logic in a service with a change in one place, you have to update many tables in the database.

I would stay away from triggers entirely, and use a queue/stream system to process data async. This can be better prioritized when the DB is under heavy load.

I'm wondering what you guys think about columnar databases and wide tables.

We use Vertica and from senior colleagues and even Vertica developers I got the impression that big wide tables are good because it eliminates the needs of join. Thus we don't use star schema and just wide tables.

However I think data modelling is also about embedding proper business logic and it would be a lot more confusing if two unrelated columns are stuffed in the same table.

For example let's say my event based game has a base for each player and he can do a lot of things in his base. If we use a wide table then we wilk see columns related to factory building side by side with pig feeding, and because each small feature has some unique columns, a lot of those columns are NULL simply because this event json doesn't even have th fields.

I'm wondering if we should use Vertica for a transactional type table and then use say sql server for dwh and build more traditional data modelling. But this could be awfully wrong maybe...

A few comments based on lots and lots of experience:

- Wide tables in columnar DBs can make some analytics queries easier to write and sometimes more performant.

- Wide tables can come at high storage cost and make other queries less performant (like SELECT *)

- How much of these things happen is extremely DB dependent. How does the DB's underlying storage mechanism work? How is the data partitioned and distributed? How sophisticated and storage-aware is the query planner? How large is your data? How fast is your data growing? How fast do you need a new data point to be reflected in your dashboard?

There's no free lunch when it comes to this stuff. A perfect solution doesn't exist, but the technology is getting better all the time. I've personally never used sql server as a data warehouse but plenty of folks do.

The stuff I use that I recommend very highly - Snowflake, TimescaleDB, vanilla Postgres. Also, use dbt.

Thanks teej for the answer.

>How much of these things happen is extremely DB dependent. How does the DB's underlying storage mechanism work? How is the data partitioned and distributed? How sophisticated and storage-aware is the query planner? How large is your data? How fast is your data growing? How fast do you need a new data point to be reflected in your dashboard?

I think most of my frustration comes from not knowing these stuffs. I work as a BA-BI hybrid as I found my data analysis skills are good complements to data modelling/airflow type of work, so I persuaded my manager to give me some BI work. But that also means I don't have the technical knowledge such as DB internals (and TBH I can't even find a book for Vertica on that matter).

Our DB and DBAs and all ETL processes are located in HQ and we actually don't own our databases. This, I guess, adds an extra layer of discomfirt as we are effectively cut off from database-level optimization. Our data engineer is about to leave because he has no DE work to do (every ETL has to go through HQ's process and we only need to submit some configuration files).

We don't have access to the databases you recommended (again HQ controls that), but I do believe I could try DBT, may I ask how do you use it? From my understanding it is mostly a transformation tool, but what makes it stand out?

No amount of technology can fix a dysfunctional org.

A large amount of the work of BI is getting data into the right shape in order to ask questions about it.

- What’s level on average do new players drop off?

- How are sales of the new skins we introduced performing vs the last set?

- How often are players engaging with feature A vs B?

You might be able to answer all of these questions with just event data in a wide table, but you’ll write very different queries to answer them. dbt is a way to write queries on your source data that leads to systematic, repeatable, and reusable new tables for answering analytics questions.

No. Vertica excels at being a column store database - a data model fit for analytics and OLAP. You should pretty much never use Vertica nor column stores for OLTP/transactional use cases.

It's not about the columns being wide, per se that is important in Vertica - although that is a benefit. It is about columnar storage and optimized querying, over row storage and optimized writing.

It would be the other way around - SQL Server or other transactional databases for that part. Actually, these days, SQL Server supports column store indexes on top of transactional tables- which gives you HTAP - hybrid transactional/analytical processing. A few other systems do this (Oracle too).

If you do HTAP, in a way you could even avoid using Vertica for data warehouse/OLAP use cases. Or you can build a separate dimensional data model in SQL Server and keep everything in there.

But if you already have Vertica, which is quite fast and good at OLAP queries on dimensional models, use it for that (enterprise data warehouse), and feed data from transactional systems in.

Thanks! Yeah you are right, Vertica is good at OLAP and DWH. I'm just torn between wide table and dimensional data modelling and it's Vertica's problem. I'm probably attacking the wrong monster here.

I'm wondering if you have any source for OLAP data modelling under big data and columnar database?

IMHO the Wikipedia definition of OLAP [1] covers the general topic quite well. Doing OLAP in the big data world, if you use the sense of OLAP equating to multidimensional analysis of data aggregated into cubes, there are several approaches to it in the open source world, such as Apache Kylin [2].

The basic idea remains the same though - building dimensional data models, and then using OLAP/cube technology with MDX on top.

That said, if I think about it, my experiences with wide columnar databases were with HBase and Cassandra. The main goals when using those systems were usually three things: A) getting many attributes/variables being tracked in a single row as part of a measurement or observation of some event - and needing to write the data ridiculously fast (append only) and report on that data in real time; B) coming up with creative data processing patterns to aggregate data into higher level tabular structures, to make for blazing fast queries on these aggregates, which could be updating in real time, and C) being able to store huge volumes of data on a redundant cluster in an efficient format.

If you look at A, B and C above, these are, in a sense, the same reasons why people wish to implement OLAP cube technology, for the most part - aggregating huge results into aggregated views, lightning fast query responses and easy of navigating - "slicing and dicing" the data. However, OLAP cubes are most commonly associated with batch processing - there are real-time OLAP cube technologies, that basically work by doing change data detection on the sources.

I kind of believe that, you don't need specialized OLAP technology to accomplish the same thing - it's all about pre-aggregated and calculated data, efficient writing/reading and storage. I mean, OLAP technology is there to try to make that easier to do, but you can do the same thing with certain tricks in traditional dimensional models (accumulating snapshots, periodic snapshots, aggregate fact tables). Or in wide columnar databases (storing aggregates in wide tables). It's just maybe easier to use OLAP technology, otherwise, you kind of need to do all the scheduling, processing and aggregating of data yourself - but it's totally doable too and I have seen that work really well.

I personally have never been a fan of OLAP cube technology (MDX), even for how powerful it is, because it was so specialized - but that opinion was formed from a legacy of years of only being able to get these features from commercial vendors.

If you look at the legacy of SQL as a language, I actually believe that there was an attempt to unify OLAP queries with SQL by implementing "CUBE" and "ROLLUP" as part of ANSI SQL standard, but it was never as powerful as MDX and the OLAP engines that were built.

I think today, the world has changed, as we have open source OLAP and MDX options available - Pentaho Mondrian was popular for a while, not sure how much today. The current hot thing seems to be Apache Kylin. I also played with Apache Druid, in the past, although it doesn't seem as popular any more, as far as I can tell. Yandex Clickhouse is very popular also these days, for being extremely fast and open source. Again, Wikipedia has a good list of OLAP systems [3].

More to the point, if you use an event-based streaming approach (things like Kafka) - where the event comes directly from the source, combined with writing the data efficiently to something like an append-only, wide, column data store.. well, you have something pretty powerful. The only thing that a solution like this lacks, sometimes, is a powerful query engine - for example, Cassandra or HBase cannot do joins. There are ways to simulate joins with Apache Hive or other technologies, sitting on top of Cassandra or HBase.. but not sure I ever saw those work well.

I mean, the goal of aggregation is to avoid needing joins.. but, you'd be surprised how once in a while, you really wish you could just join this with that to come up with a new result - even in things like Cassandra or HBase. If you do your design right, you can avoid it.. but I find myself wanting it sometimes.

Really, in OLAP, everything is built to avoid JOINs or minimize them as much as possible. It's one reason, I believe though, that having JOINs in a dimensional model is still helpful. But maybe that's the old geezer who still likes SQL in me talking :)

Lastly, I thought I might mention - anyone who says that leaving all data without any kind of model structure or architecture (ex. flat or raw data only) is in a world of hurt, once you get to scale. This is what creates data swamps instead of data lakes. Data models are still as important as ever - snowflake, dimensional, multidimensional, etc. Pure raw data can be very useful for exploratory analysis, even machine learning models, but easy analysis for end consumers always requires some form of data modelling and architecture.

There is an opinion that we should throw data modelling to the wayside, and just feed raw data into AI and machine learning models - but I personally have never seen this as a reality yet.

I hope all the above might be useful!

[1] https://en.wikipedia.org/wiki/Online_analytical_processing

[2] http://kylin.apache.org/

[3] https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers

JSONB objects with SQL relations in Postgresql is my happy-medium between the joy of schema-less JSON and the reassurance of SQL relations.
Would you explain a little more? I'm intrigued. Are you saying SQL relations between fields inside JSONB columns between tables? A field in the JSONB column has foreign key to another table? And you can do a join?
You can do both. You can make relations between jsonb fields.

Or, as I do: have a jsonb blob in one column, and a normal sql field in another, and do the sql joins with that.

So I have some data fields as sql columns, and some as jsonb properties.

Most relational databases aren't relational enough

For example in Drupal you can have a node table which is a foreign key relationship to many other custom tables

In SQL I can't say get me all the nodes that have a start date without explicitly left joining to a potentially dynamic number of other tables using a higher level language to modify the query

In Datomic or Datascript or Datahike or Datalevin or Crux this query would be trivial these are not nosql databases they're EAVT datalog engines, that in some cases also support SQL for compatibility

You're judging all relational databases having looked at Drupal's approach? Ok...
This is useful for most small apps or truly monolithic apps. ...But normalization in a distributed environment with 50+ apps? Really? You need to have some sort of duplication unless you want big bottlenecks, performance penalties, and hot headaches

From the linked MS article: "Redundant data wastes disk space and creates maintenance problems"

Made me laugh well

> You need to have some sort of duplication unless you want big bottlenecks, performance penalties

I think that's called a cache. You can still have solid database design at the core. Then add redis on top of it.

It can be non cached data as well. Imagine a data that needs to be sorted by data that's owned by another another service. You need to replicate the data as you can't do a join+sort between the data coming from two apis effectively. You don't treat that data as a source of truth, but do use it some UI purposes.
Duplication is OK provided there are strict, documented rules about which copy is the "point of truth". If the duplicates diverge you need to know how to fix them.
This seems contrary to what I have learned in my career as an application developer on data heavy platforms. Namely the first section that concludes: > Having stressed the importance of good database design...

I'm not in agreement with the author's concept of good design. I don't want other "doors" to edit the data that bypass the application logic. That's the mess SQL enables for DBAs and scripts that think it's okay to change data without executing business logic.

I think it's better to accept your data and application layers are coupled and plan to evolve and refactor them together rather than teach your database to defend itself from the evil business logic in applications.

I also don't think this dictates relational DBs vs nosql. Your data model is probably relational but how you choose to store and access the data depends more on what use cases you are trying to enable.

When you see that SQL access can ruin data and make them invalid, it's usually a strong indicator of an inadequate DB design. (Maybe you did not have a chance to see a good one?)
The pure, normalized models weren't performant enough so you are correct - the design wasn't all roses in the SQL systems. We had largely flattened down some table groups to be able to get data fast enough. But when you hit that point you may have outgrown RDBMS

I'm not saying you can't get one to run fast enough but the ROI starts to decline significantly.

When creating a visualization for your database design DO NOT under any circumstance cross relations as they go from one table to another. The client will get confused and frustrated. Looking at this article's visualizations hurt my eyes. Very bad choice for what's suppose to be a "good" design.
Thanks for feedback. I just included it as a picture, but you're right.
> we should keep numeric data in “integer”, “numeric” fields

I end up keeping numeric data as text when I'm ingesting an external data source that I don't trust not to change ID format on me. They're all numeric now, but the format could change, and the actual numeric value of the ID is not important at all.

Yes, be careful conflating numeric data with data that looks numerical. IDs fall into the latter for sure. Unless you have complete control over it, then it might always be numeric.

E.g. back when I thought I was super smart, on one project I made the credit card cvv a number. Except they can start with 0. Whoops.

Same here, but with US postal codes. Having grown up on the west coast I didn't realize some started with a zero until embarrassingly late in life.
The best rule of thumb I've seen for this is that unless incrementing a value makes sense, probably use a text field unless you have some other really good reason. So: auto-incremented IDs yes, arbitrary IDs that you don't control no, phone numbers no, SSNs/PINs/ZIP codes, also no.
Nice link. Nothing controversial, but sometimes simplicity is controversial in our field.

I've slowly come around to seeing proper database design as the most essential foundation of an IT system. I remember reading "your data will outlast your application", and I've been around as a developer long enough to have lived it.

One big anti-pattern I've seen with ORMs is that developers who don't really think in terms of data and relationships use the ORM as a kind of object serialization usable only from the application. Rather than thinking of the database as something useful that could be queried and accessed outside the context of the application, they write objects out to various tables and then re-import and re-construct them once they're needed again in the app, often with dependencies that are in code or even yaml configuration files. The upshot is that you simply can't use the database as anything other than a persistence tier for an app. It really would be no different if they had simply given the objects and id and pickled them to disk. The resulting data store really is that inaccessible and meaningless outside the context of the app.

As a result, if an analyst wants a report, they can't write SQL to get those reports, even though the persistence tier is, if perhaps in name only, a "relational database". And when the app goes away (as it inevitably does), they'll still want to know, say, how many beakers and test tubes were ordered by a lab tech who participated in 100 or more experiments per year with at least one faculty member from radiology between June and December in 2007.

But because they don't really know SQL, they see no value to it, and they're honestly just kinda irritated that they don't have an object database, which is what they understand a database to be - a way to pickle and reimport objects. The application outlasts the developer, and the data outlasts the application.

Yeah, if you're writing an app that will hold potentially useful data, definitely think about how this will be accessed outside the context of the app that perhaps inspired its creation, and think about how you'd get at the data if the application went away. That'll probably lead you to old, good database design.

Now, I actually do think that you can create a decent relational database through an ORM. I saw this go sideways with Rails a bunch of times, but I think that's probably because Rails made it so easy to start developing that a lot of people new to application development skipped the design stage of the data backend. You can construct a pretty robust database with migrations rather than CREATE TABLE statements, as long as you stick to the basic principles - but the technology makes it relatively easy to do the opposite and never really get into the mindset of data.

> Nothing controversial

ahem

> Foreign Key constraint is the king of the relational database design

Amazon does not use FK constraints and I have rarely run into systems that do (since 1996ish). Most people with big enough datasets learn not to use them. The overhead for orphaned data is far less than the consequences of using them.

Ah. Fair enough, yes, I agree. That statement goes a little too far, and is a bit at odds with the otherwise overall light tone of the article and the YMMV-ish disclaimer at the end.
Can you clarify about the overhead you're speaking of?

I assume it only comes into play at super massive scale like Amazon-level datasets.

Not even super massive like amazon, even an app for a few million users you’ll run into the performance problems of FKs. They are really overrated in their usefulness - as parent says, orphan records are really not a big deal. As soon as you get to any moderate scale, dealing with a small level of data inconsistencies is inevitable. Especially as you grow into a multiple services, multiple databases type architecture where you fundamentally have to handle breaking fks btw systems
When not using FK constraints, data may be modified or lost (deleted) without proper cleanup of related records leading to "orphaned" records without relationships (or related ids that do not correspond to existing records). The disk space for these orphaned records and the time to run a relational integrity checks across tables (removing orphans), is of minor concern and effort, comparative to a troublesome database entry with FK constraints.
> One big anti-pattern I've seen with ORMs is that developers who don't really think in terms of data and relationships use the ORM as a kind of object serialization usable only from the application. Rather than thinking of the database as something useful that could be queried and accessed outside the context of the application, they write objects out to various tables and then re-import and re-construct them once they're needed again in the app, often with dependencies that are in code or even yaml configuration files. The upshot is that you simply can't use the database as anything other than a persistence tier for an app. It really would be no different if they had simply given the objects and id and pickled them to disk. The resulting data store really is that inaccessible and meaningless outside the context of the app.

> As a result, if an analyst wants a report, they can't write SQL to get those reports, even though the persistence tier is, if perhaps in name only, a "relational database". And when the app goes away (as it inevitably does), they'll still want to know, say, how many beakers and test tubes were ordered by a lab tech who participated in 100 or more experiments per year with at least one faculty member from radiology between June and December in 2007.

> But because they don't really know SQL, they see no value to it, and they're honestly just kinda irritated that they don't have an object database, which is what they understand a database to be - a way to pickle and reimport objects. The application outlasts the developer, and the data outlasts the application.

I'm on the developer's side here. Databases are too big and complex to be used by more than one application; if you have a database then it's really important to have a single owner for that database, or you'll never be able to evolve the schema. If you let an analyst write their own queries against your database, you have no control over what queries are running from where or for whom, so you can't so much as rename a column. And sooner or later your production system will lock up because an analyst wrote a query that they didn't realise had too many joins in.

If it's important to have an archival record of your data, put that in your design constraints and build it into the system. Even then, I probably wouldn't pick an SQL database as the system for doing it with.

Even if you have an ETL pipeline to an OLAP database/data warehouse/etc, if your core database design is hostile to the analytics/etc then it's going to be a pain no matter how carefully they use it.

> it's really important to have a single owner for that database, or you'll never be able to evolve the schema...

IMO, the "owning" application/developers reserve the right to evolve the schema-and if that temporarily breaks ETL, then so be it, but the underlying schema itself shouldn't be hostile to analytics/etc.

> Even if you have an ETL pipeline to an OLAP database/data warehouse/etc, if your core database design is hostile to the analytics/etc then it's going to be a pain no matter how carefully they use it.

Disagree. You don't need a single "core database design". It's fine to have different representations of your data for different purposes, and a transformation pipeline between them; that's the whole idea of CQRS etc.

Yes I'm not disagreeing there, I'm all for pipelines and CQRS and dedicated databases for dedicated purposes. The point I'm making is that if the original schema is a pain to work with, you can have as many pipelines and databases as you want, getting the actual data you want isn't any less of a pain.
> if the original schema is a pain to work with, you can have as many pipelines and databases as you want, getting the actual data you want isn't any less of a pain.

I don't think that's really true. If the original schema is just something you're ingesting before transforming then it doesn't really matter how bad it is; all you're gonna be doing is scanning over all the tables one way or another.

Hi, my apologies if it's a bit off topic, but I wonder if you have any advise or can point me in some direction on the way to becoming a DBA, or something similar enough? I've been working as and old school sysadmin for the last 17 years and counting. Even if I can do nowadays devops stuff, I always liked the DBA work, based on my experience as sysadmin for DB servers (mysql, oracle, and friends...), and I'm considering moving my career in that direction. So is there any "DBA certification" for which I could take a course and all that? Thanks!!
You've hit on the major problem with DBA work, which is that you are touching some of the most expensive prod stuff and even if you have "that cert" many places just want to see that somewhere else trusted you as a DBA for that type of product.

I managed to transition in a company in dire straits that had no other options, and then getting new DBA jobs was fairly easy.

Thanks for your feedback, and I pretty understand what you mean, I've worked with some DBA know as the "million dollars error guy". But I think that's a second step, and anyway we were maintaining the servers where those dbs were running... So we deserve some trust maybe?
Which software is used to make those diagrams? The ones I use don't have good arrows like that
Is the diagram a joke? That's total ERD spaghetti.
It's included only as a picture :)