Hacker News new | ask | show | jobs
by throwaway894345 2102 days ago
If one of the purposes of relational databases is data modeling, I've always wondered why there aren't good semantics for sum types. The real world is full of them, but databases can't express them. When I bring this up, some people respond that this is the purpose of ORMs; however, this implies that we have an arbitrary bifurcation in which some of the processing happens efficiently in SQL and anything that depends on sum types has to get hoisted up and over a network to application code. Further, it allows for different clients to behave differently, possibly having different notions of what any given sum type's variants are (which leads invariably to data corruption). I really wish databases did better here, but maybe I'm missing something.
9 comments

There's no real reason you can't have sum types in a relational database, this just goes back to our current crop of RDBMSs being quite old and predating the current emphasis on strong typing. I suspect a modern relational database with an easier way define complex types (not writing a plugin) would be very popular, but the amount of work required to make something like that is immense
That matches my intuition. The more surprising thing is that this doesn't seem to be talked about very much. I've literally never heard anyone else advocate for a relational database with support for sum types (I'm sure they exist) which has caused a lot of introspection about whether or not there's something about RDBMSs that I just don't understand.
it came and went, postgres and sql server were (and still are) referred to as "object-relational" because of their support for rich user-defined complex types. it's just not that practical real-world because we tend to start with first normal form which complex types violate naturally.

> whether or not there's something about RDBMSs that I just don't understand

perhaps if you consider the entire result of some query to imply the type of the thing being modeled rather than any given table or cell, then it's more obvious that the basic types supported aren't much of a limitation? afterall you can always include a "type" indicator (or project one from a more sophisticated relational model).

Normal form supports complex product types just fine. A row is an object and a row containing a reference to a row in another table is equivalent to an object containing another object, and of course a result set is a list.

But I’m not talking about product types, complex or otherwise. I’m talking about sum types (also known as Algebraic Data Types).

> A row is an object and a row containing a reference to a row in another table is equivalent to an object containing another object

if you're already thinking of it that way then it's a small jump to the result set as a set, that can participate in a union - and there's your sum type. add or project a type indicator and it's literally a tagged union.

That's not type safe.
Three techniques for sum types in SQL (absorption, separation, and partition):

https://www.parsonsmatt.org/2019/03/19/sum_types_in_sql.html

Also, here is a relational database system with native support for sum types (and also no "NULL" nonsense, which is also not part of pure relational algebra):

https://github.com/agentm/project-m36

Oh, very cool. I'll check those out. Thanks for sharing!
I tend to agree. Algebraic data types (product and sum types) is essential for modeling. Missing sum type would be incomplete in terms of modeling.

A lot of obscure OO code I have seen is related to modeling sum types of products (or parameterized enumerated in some language), or the lack of it in mainstream languages.

The problem applies to SQL as well. If I make everything into the same table and leave many columns empty I’m not modeling at all. But if I split them into tables it defeat most basic SQL functionalities like SELECT * FROM my_sum_type.

The traditional relational model is very focused on mutable data and normalization. Different types would be categorized in separate columns. So this idea would run counter to "best practice" and need something foundational behind it, which would be just enough out of scope for a traditionally typed relational datastore. Maybe this is just another way of saying the underlying theories are different, or covering different areas of computation and storage. So the prevailing wisdom is to not expose type complexity explicitly to clients, but just export the inherent problem implicitly.
It seems that the relational model plainly enough wants to be the gate keeper for your data model—it gives extensive tools for modeling and enforcing data schema, but it just kind of throws its arms up at data that is “OR” shaped. Some people argue that it’s because there’s not an obvious way to lay out sum type data in memory or on disk or on the wire, but these problems are all solved by traditional programming languages (data is data, at the end of the day). If you want to take the “different philosophies” tack, then it seems like a philosophy that only addresses AND-shaped data leaves a lot to be desired.
I'm interested in understanding what you mean. What is "OR" shaped data? Are you thinking of data like, "The staff member must have either a salary or an hourly rate"?

Typically I would see this modelled with two db columns, with a DB constraint indicating that only one of these can have a value.

That’s generally what I’m talking about, and it works well enough for simple cases, but these sorts of solutions tend not to scale well and you give up type safety. Not the end of the world, just disappointing.
Sum types are harder to express as data integrity constraints that product types - for example, Horn clause constraints can be used to express that a table is a product (or limit, more generally) of other tables, but to express that a table is a co-product (or colimit, more generally) requires a stronger logic (handling disjunction, for example). This has implications for data processing algorithms such as the chase that depend on the strength underling logic.
Can you give an example of real world data modeling where you want more expressive sum types over just using enums? Enums are technically a subclass of sum types, but even those are non-trivial to use at a data format level (Try evolving them in an on-the-wire message format like Avro or Protobuf).
Imagine a system that allows third-party login (Facebook / AppleID / whatever) - then the account has either a username/password hash or an oauth token or some other kind of structured data.

Delivery addresses for a system that supports both physical and digital products - you want a type-level distinction between physical and digital addresses, but an order might be being shipped to either.

Subscription vs free trial - they're different kinds of thing, but you want to store more details (e.g. expiry date) than just an enum of one or the other.

>Imagine a system that allows third-party login (Facebook / AppleID / whatever) - then the account has either a username/password hash or an oauth token or some other kind of structured data.

What I've seen most often is you have to deal with account merging but lets say you do really want either/or...

Wouldn't you just have your third party tables (each with their own idiosyncrasies) and in your user table you'd have login_type and login_id columns? You know which table to hit by type using the id?

> Wouldn't you just have your third party tables (each with their own idiosyncrasies) and in your user table you'd have login_type and login_id columns? You know which table to hit by type using the id?

You can do that but it's a bodge. E.g. you won't be able to have the foreign key constraints you'd normally have on that login_id column. And good luck writing a query that actually does something differently for each case - you'd have to do something like multiple left joins and there's no way to check you've handled all the cases and not done one of them twice.

You can have a table or view of all the ids across your implementations and FK into that.

As for how you would model it in the application, in this case you can just normalize across all the possible columns and have the orm build out your mapped object.

> You can have a table or view of all the ids across your implementations and FK into that.

That's just moving the problem around. That table would have to have a bunch of nullable columns, and there's no way to express the constraint that either these columns are non-null or those columns are non-null, or these columns are non-null when this enum has this value.

I'm guessing you've never seriously used a language with first-class sum types. Yes, you can use hacks to represent sum-typed data in languages that don't have proper sum types, but it's always going to be a hack. It's like saying C has OO support because you can always construct virtual function tables by hand.

How do you model "postal address"? Some postal addresses are PO Boxes, some are street addresses, etc. There are canonical representations of these different cases. Do we just shove it all in a string, and let the application perform domain validation?
Don't even try to do validation on postal addresses. The postal system has so many corner cases that you'll never be able to correctly handle all of them. Every mishandled corner case will cost you, or your counter-party, time and money.

Just dump addresses into a unicode string and let the postal system figure it out.

postal address is one of those cases where you probably do just want to shove it all in a string as most structural constraints eventually backfire - especially if you support international: http://www.columbia.edu/~fdc/postal/

the most common schema I've seen is usually something like line1, line2, line3, city, state, country, zip, etc. if it's a reporting database then city/state/country/zip is often mashed into some sort of location id.

Watch out for `state` as well, please don't make it mandatory like so many websites. No, a region is not the equivalent of a state in France, you don't need it for my package to get there!
Each type of postal address is a separate column. New postal address "types" would get new columns. This works particularly well when addresses can have both PO boxes as well as street addresses. This is actually more flexible than tagged unions/sum types, at least for this particular case.
So a PO Box address or a house address would be concatenated into a string value and then stored in either Addresses.POBoxAdress or Addresses.House? It’s still not structured. How can someone easily get the postal-code/zip-code?
I think they may mean that the result set has elements of different types. For instance if you stored restaurants by genre but wanted a list of all restaurants, but retaining all of the unique fields, you currently need to generate the product type of the genres.
Hmm, is it that hard?

    Vehicle table -- ID, TypeId, Make, etc.  (123, 456, ...)
    TypeId table -- ID, Type (456, motorcycle)
    Motorcycle table -- ID, HandleBarStyle, etc. (456, Low Rider, ...)
    Automobile table -- ID, TrunkSpace, etc. (789, ...)
You can pretty easily add extra information to any id as long as you know where to look, and that can be a simple enum column to define the concrete type (and thus what data to grab). Its an easy enough join, isn't it?

The data modelling isn't the hard part really. Pulling it into an application in a nice way is probably harder. I don't often use languages with sum types professionally so maybe this is way off base but I don't see an issue.

How does the vehicle table ID column enforce referential integrity with the primary keys of the other columns? In general, there are a lot of issues that come up with respect to type safety and query semantics when you use these workarounds. They’re okay, but disappointing.
You can enforce that the column is non-null and has a valid value id in the relation table. You can also enforce that your motorcycle and auto tables are referenced in the relation table or make it a view of all the ids across the types (conceptually anyway, I'd have to look into the perf). It's a contrived example but what is the issue? What are the drawbacks? These are pretty easy constraints to add.
If you drop a row from the motorcycle table without CASCADE does it error if the vehicles table still contains a pointer? If you drop it with CASCADE does it also drop it from the vehicles table?

Honestly it’s been several years since I’ve done standard app development, so I’ve forgotten some of the specific problems that I would run into. I just remember that I tried all of the gimmicks to emulate sum types and they were always clumsy. Not a very satisfying rebuttal, I know.

What's difficult about them? I typically use nullable columns and then a check constraint to specify a custom condition for nullability. Columns belonging to the same alternative in the sum type must be all null or all not null. And then there's check only one active alternative.
This approach doesn’t scale when business requirements change regularly such that you need to add or remove columns to an existing table.

Adding new columns by creating a new table is easy and cheap and doesn’t involve downtime. Adding 100+ columns to an existing table because the spec said a relationship went from 1:0-1 to 1:1 is a pain.

(This can be avoided with creative design with deferrable constraints, something certain major RDBMS are still lacking, gah!)

> Adding 100+ columns to an existing table

Whaaaa 100+ columns? How often do you need to add 100 columns to an existing table?

I mean honestly that has to be as rare as chicken teeth.

>>Whaaaa 100+ columns? How often do you need to add 100 columns to an existing table? >>I mean honestly that has to be as rare as chicken teeth.

Haha, and yet it happens. One of the worst databases I ever dealt with was some ridiculous variant of Universal Model for a SCADA application, the databases were named D1, D2, D3, and so on; the tables T1, T2, T3, etc; the columns C1, C2, C3, you get the picture I hope...when they hit maximum column count they would create a new table and bizarrely rebalance renumbered columns between them. In database D0 there was a catalog with all of the details relating back to the actual model. The system was actually implemented at some defense contractors and power system operators, I remember someone at Lockheed whining about hitting the number of databases limit for Sybase.

Hopefully nobody has to deal with that crap any more.

Oof I feel your pain
Whoah I've never dealt with tables with more than 20 columns. Can't imagine the shops that require 100+ columns.
Type safety And query semantics, mostly. You can use constraints to get some of this back, but it only goes so far. Ultimately there’s a reason statically typed programming languages developed sum types, and all of those reasons apply to databases as well because data is data.
you can avoid nullable cols by using separate physical tables for each concrete type with surrogate PKs from a common sequence. these can be concatenated back together in a view (with no performance hit if you're careful not to hide indexes and predicate pushdowns). FKs then can reference the appropriate concrete table and in case you need an FK on some union you can use an indexed view instead.

I've used this approach for modeling entities that are polymorphic in object-land and haven't felt particularly underequipped - more robust indexed view support would be nice but that's active research territory (one of the hardest problems in computer science, materialize.io looks promising!)

> If one of the purposes of relational databases is data modeling

Huh?

Creating a data model before creating a database, is like writing an outline before writing an essay. It organizes your thoughts and gives structure to what you are about to do. Once you have a data model, you can then implement it using whatever database technology you choose.

If you don't start with a data model, you literally don't know what you are doing.

I think you misunderstood my comment. I’m arguing that relational databases would be more useful if they had sum-type semantics. Data models often have OR-shaped data, and pretending like this class of data doesn’t exist and making it the purview off application code makes relational databases much less useful than they might otherwise be.