Hacker News new | ask | show | jobs
by fusiongyro 5043 days ago
I assume by complex data type you mean something like (say) complex numbers, IP addresses, etc. The key difference is that the components of a complex data type don't stand on their own, or it's a violation of 1NF. I wouldn't mind storing a complex number in one column instead of two largely because I'd rather math with complex numbers "just work" and I'm unlikely to go updating the database for just the imaginary part. Likewise with IP addresses, it is useful to query them with netmasks and the like, but you're unlikely to query or update some middle component of them.

The interesting thing about arrays is depending on the work you're doing with them, they could either be a violation of 1NF or not. For example, if you're storing vectors in the physics sense, you may not be querying or updating their contents, in which case it's not a violation of 1NF because it's an atomic value. The danger of encouraging people to use Postgres's arrays is a fair number of the people reading are developers and will take this as carte blanch to store repeated values or sets as arrays, and miss out on a bunch of the power and elegance of relational databases (and probably cause themselves a lot of pain dealing with a type that has no direct and helpful JDBC mapping, for example).

> if for no other reason than it could save a lot of space

I discourage people from worrying too much up-front about the space utilization of their database design for several reasons. Intuitions about database storage and performance tend to be incorrect. Indexes usually matter a lot more in terms of disk use than the exact types of the columns. It's also a bad idea to spend a lot of energy optimizing the database before you have a significant amount of real-world data. Query performance alone varies dramatically between fake and real-world data sets, in part because the query planner will choose different strategies based on the sizes of tables and indexes and also because you usually don't know ahead of time what your access patterns will look like.

When it comes to databases, there really aren't very many hard and fast rules compared to rules of thumb. The rule of thumb on arrays is don't use them, and there are good reasons for that. It could be that you see massive performance and disk savings with tags and the loss of relational correctness doesn't bite you in a particular application, but you should investigate before choosing. All optimizations are trading flexibility for performance. You should make sure you don't need the flexibility and you do need the performance before making the trade.

1 comments

"The key difference is that the components of a complex data type don't stand on their own, or it's a violation of 1NF."

The characters or substrings that make up a string stand on their own. In fact, it's pretty hard to differentiate a string from an array of characters.

Similarly, the date component of a timestamp stands on its own, and timestamps are routinely broken up in that fashion.

"The interesting thing about arrays is depending on the work you're doing with them, they could either be a violation of 1NF or not."

In other words, you're saying that queries are a violation of 1NF, rather than the schema?

What about integers and modulo division? Is modulo division considered to be accessing a component of an integer?

"people reading are developers and will take this as carte blanch to store repeated values or sets as arrays"

There's no substitute for careful thought when designing a schema.

"Intuitions about database storage and performance tend to be incorrect."

Good point.

"loss of relational correctness"

I still don't think you've justified this. Remember that the normal forms are formally defined, so you should be able to show a clear distinction, which I've not seen yet.

"All optimizations are trading flexibility for performance."

What particular kind of flexibility is lost in this example (product tags)?

"The rule of thumb on arrays is don't use them, and there are good reasons for that."

This really sums up the situation. It's easy to use arrays for the wrong things, so the rule of thumb is to never use them. But they are still useful often enough for developers to get frustrated, and the rule of thumb may cause them to invent a worse solution elsewhere.

We should make the good reasons known, rather than just the rule of thumb.

I think you're getting hung up on the low-level details and missing the big relational picture.

Timestamps are atomic units. If I add 60 minutes to 10:15 AM I don't want to see 10:75 AM, I want to see the 11:15. This is because the "components" of the timestamp are not independent values. Likewise, a string may be an array of characters, but a name is an atomic entity. It's meaningful to break a string into characters (low-level) but it's not meaningful to break a name into name-characters.

Now, if I had a reason to make a relationship between one entity and many characters, the correct way to model it in the relational database is to use a one-to-many with a character column. Using a string instead is not the right thing to do. Strings, for starters, have an ordering, and sets do not. Indexing that string column probably won't perform as well as indexing the separate table, because string indexes care about that order. It may perform worse, or model the situation worse, for a variety of reasons. In practice, I've never seen anyone make a relationship between a character and an entity.

The distinction between when I'd use a separate row for each character and when I'd use a string is exactly the distinction between when using an array is 1NF and when it isn't. I don't think the idea of avoiding repeating groups says anything concrete about data types. I'd be surprised if the theorists spent a great deal of time discussing the philosophical ramifications of strings being composed of characters or what each type means on-disk; at least 40% of every book by Joe Celko consists of admonishment to avoid worrying about the physical layout. I'm sorry if this comment isn't sufficiently formal for you, but it's the best you're going to get from me. Lord knows this won't be the first time database theory and practice have not matched up to everyone's satisfaction. Maybe next we can discuss implementing transitive closures in portable SQL?

Back to arrays, the basic loss of flexibility is portability. If you modeled it traditionally with tables, your result would be portable to any RDBMS, including whatever indexes and constraints you made. By modeling it with arrays, you're limiting yourself to Postgres, and you're probably forcing yourself to use check constraints to ensure uniqueness within each array (if you care). The second loss of flexibility is that generic database access libraries probably don't support arrays natively. Hibernate, for example, does not "do" arrays, so you'll either be bypassing it altogether or converting them to strings and parsing them on the way out. These are the reasons I have off the top of my head; I wouldn't be shocked if there were others.

I don't think I'm going to see a third option besides arrays and normalization, but in this case, it would almost always be fine for the developer to err on the side of normalization. Are there uses for them? Sure. Just not many.

"The characters or substrings that make up a string stand on their own. In fact, it's pretty hard to differentiate a string from an array of characters."

It's difficult to differentiate a number from an ordered n-tuple of bits, but that doesn't make storing the number 123 storing an array.

I took the parent post's point to be clear: it isn't whether you can look at something as, in some sense, an array, it's whether you treat it as an atomic value. That's generally what you do with numbers: you update the whole number, not the third bit. And it's likely what you do with strings---even if you just want to change one character, you update the whole string, not the third character.

Likewise, if you want to store a complex number as a 2-tuple, you could treat that as an atomic data type even though it consists of two elements by always operating on the pair*. That's the respect in which it "stands on its own"---you don't peer inside.