> And no, using arrays is not an automatic violation of first normal form.
One of the conditions of the 1NF is that each row and column contain one and only one value. While a NULL might be disputable, using multiple values as shown in the two examples (tags, items/price/quantity) are a clear violation of the 1NF. Only if the array data were truly "one value" (e.g., a vector) it would not be a violation. (I.e., I am not trying to contradict your statement, but I think it helps to add a bit of clarification.)
If you never need to query 'inside' the array, then I don't think that's a violation of 1NF at all. If you treat the entire array as an atomic value, then I would think that's still 1NF.
Arguably, yes, because you haven't really broken your data down as much as you need to; but this is a fairly contentious issue - just see http://en.wikipedia.org/wiki/First_normal_form#Atomicity. If you are frequently needing to query inside data, you may well gain from moving that 'prefix' into a separate column, or more drastically rethinking your data model if that's appropriate.
"One of the conditions of the 1NF is that each row and column contain one and only one value."
There is no useful differentiation between atomic and non-atomic values. A timestamp is made up of a date and a time, a string is an array of characters, a real number has a sign and magnitude, etc.
I suppose you could just use boolean only, which could be regarded as "atomic".
Arrays aren't really in the spirit of relational design. This means that using them will change the ACID characteristics compared to a relational design. In the article's first example, it would seem like adding another item to a purchase would cause the entire purchase row to lock - this wouldn't happen if items were stored in their own table.
That's not to say they aren't useful though; read performance would be much better with all the data living in a single row somewhere on disk. SQL queries would also be easier to write.
It's a slightly contrived example, because in a real-world schema you would want enough information about each line item that an array isn't practical. But I don't think locks are an issue here, because no sane business process would need different transactions that were updating the same purchase's line items in conflicting ways.
"Arrays aren't really in the spirit of relational design."
Why do you say that? To a certain extent I agree in broad terms; but I don't believe that it is somehow "unclean" or un-relational just because you are using an array (or any other complex data type).
"This means that using them will change the ACID characteristics compared to a relational design."
No, arrays are protected by ACID as is any other data in postgres.
"it would seem like adding another item to a purchase would cause the entire purchase row to lock"
It will still allow concurrent reads of the row in postgres (MVCC). In order for the write lock to be a practical problem, there would have to be a lot of concurrent updates to the very same purchase.
I think we can all agree that the first example is "hacky" (which the author says in the article), so let's ignore that one. How about the second example? I think it's pretty reasonable to store tags that way, if for no other reason than it could save a lot of space.
I assume by complex data type you mean something like (say) complex numbers, IP addresses, etc. The key difference is that the components of a complex data type don't stand on their own, or it's a violation of 1NF. I wouldn't mind storing a complex number in one column instead of two largely because I'd rather math with complex numbers "just work" and I'm unlikely to go updating the database for just the imaginary part. Likewise with IP addresses, it is useful to query them with netmasks and the like, but you're unlikely to query or update some middle component of them.
The interesting thing about arrays is depending on the work you're doing with them, they could either be a violation of 1NF or not. For example, if you're storing vectors in the physics sense, you may not be querying or updating their contents, in which case it's not a violation of 1NF because it's an atomic value. The danger of encouraging people to use Postgres's arrays is a fair number of the people reading are developers and will take this as carte blanch to store repeated values or sets as arrays, and miss out on a bunch of the power and elegance of relational databases (and probably cause themselves a lot of pain dealing with a type that has no direct and helpful JDBC mapping, for example).
> if for no other reason than it could save a lot of space
I discourage people from worrying too much up-front about the space utilization of their database design for several reasons. Intuitions about database storage and performance tend to be incorrect. Indexes usually matter a lot more in terms of disk use than the exact types of the columns. It's also a bad idea to spend a lot of energy optimizing the database before you have a significant amount of real-world data. Query performance alone varies dramatically between fake and real-world data sets, in part because the query planner will choose different strategies based on the sizes of tables and indexes and also because you usually don't know ahead of time what your access patterns will look like.
When it comes to databases, there really aren't very many hard and fast rules compared to rules of thumb. The rule of thumb on arrays is don't use them, and there are good reasons for that. It could be that you see massive performance and disk savings with tags and the loss of relational correctness doesn't bite you in a particular application, but you should investigate before choosing. All optimizations are trading flexibility for performance. You should make sure you don't need the flexibility and you do need the performance before making the trade.
"The key difference is that the components of a complex data type don't stand on their own, or it's a violation of 1NF."
The characters or substrings that make up a string stand on their own. In fact, it's pretty hard to differentiate a string from an array of characters.
Similarly, the date component of a timestamp stands on its own, and timestamps are routinely broken up in that fashion.
"The interesting thing about arrays is depending on the work you're doing with them, they could either be a violation of 1NF or not."
In other words, you're saying that queries are a violation of 1NF, rather than the schema?
What about integers and modulo division? Is modulo division considered to be accessing a component of an integer?
"people reading are developers and will take this as carte blanch to store repeated values or sets as arrays"
There's no substitute for careful thought when designing a schema.
"Intuitions about database storage and performance tend to be incorrect."
Good point.
"loss of relational correctness"
I still don't think you've justified this. Remember that the normal forms are formally defined, so you should be able to show a clear distinction, which I've not seen yet.
"All optimizations are trading flexibility for performance."
What particular kind of flexibility is lost in this example (product tags)?
"The rule of thumb on arrays is don't use them, and there are good reasons for that."
This really sums up the situation. It's easy to use arrays for the wrong things, so the rule of thumb is to never use them. But they are still useful often enough for developers to get frustrated, and the rule of thumb may cause them to invent a worse solution elsewhere.
We should make the good reasons known, rather than just the rule of thumb.
I think you're getting hung up on the low-level details and missing the big relational picture.
Timestamps are atomic units. If I add 60 minutes to 10:15 AM I don't want to see 10:75 AM, I want to see the 11:15. This is because the "components" of the timestamp are not independent values. Likewise, a string may be an array of characters, but a name is an atomic entity. It's meaningful to break a string into characters (low-level) but it's not meaningful to break a name into name-characters.
Now, if I had a reason to make a relationship between one entity and many characters, the correct way to model it in the relational database is to use a one-to-many with a character column. Using a string instead is not the right thing to do. Strings, for starters, have an ordering, and sets do not. Indexing that string column probably won't perform as well as indexing the separate table, because string indexes care about that order. It may perform worse, or model the situation worse, for a variety of reasons. In practice, I've never seen anyone make a relationship between a character and an entity.
The distinction between when I'd use a separate row for each character and when I'd use a string is exactly the distinction between when using an array is 1NF and when it isn't. I don't think the idea of avoiding repeating groups says anything concrete about data types. I'd be surprised if the theorists spent a great deal of time discussing the philosophical ramifications of strings being composed of characters or what each type means on-disk; at least 40% of every book by Joe Celko consists of admonishment to avoid worrying about the physical layout. I'm sorry if this comment isn't sufficiently formal for you, but it's the best you're going to get from me. Lord knows this won't be the first time database theory and practice have not matched up to everyone's satisfaction. Maybe next we can discuss implementing transitive closures in portable SQL?
Back to arrays, the basic loss of flexibility is portability. If you modeled it traditionally with tables, your result would be portable to any RDBMS, including whatever indexes and constraints you made. By modeling it with arrays, you're limiting yourself to Postgres, and you're probably forcing yourself to use check constraints to ensure uniqueness within each array (if you care). The second loss of flexibility is that generic database access libraries probably don't support arrays natively. Hibernate, for example, does not "do" arrays, so you'll either be bypassing it altogether or converting them to strings and parsing them on the way out. These are the reasons I have off the top of my head; I wouldn't be shocked if there were others.
I don't think I'm going to see a third option besides arrays and normalization, but in this case, it would almost always be fine for the developer to err on the side of normalization. Are there uses for them? Sure. Just not many.
"The characters or substrings that make up a string stand on their own. In fact, it's pretty hard to differentiate a string from an array of characters."
It's difficult to differentiate a number from an ordered n-tuple of bits, but that doesn't make storing the number 123 storing an array.
I took the parent post's point to be clear: it isn't whether you can look at something as, in some sense, an array, it's whether you treat it as an atomic value. That's generally what you do with numbers: you update the whole number, not the third bit. And it's likely what you do with strings---even if you just want to change one character, you update the whole string, not the third character.
Likewise, if you want to store a complex number as a 2-tuple, you could treat that as an atomic data type even though it consists of two elements by always operating on the pair*. That's the respect in which it "stands on its own"---you don't peer inside.
Arrays are actually pretty cool in PostgreSQL and keep getting cooler. Unnest() turns arrays into relations for example.
One of my primary uses for arrays is for passing data to/from the application. Data may not be stored in the db as an array, but it really makes passing complex data structures to/from stored procedures a lot easier.
There is nothing that unique about the inline support of other data structures. Many databases have done it in the past e.g Oracle and do so today especially in NoSQL land e.g. MongoDB, Cassandra.
I think Jeff was principally referring to relational dogma that is so prevalent in other SQL database system culture, even though they might implement such types to get a check-box (or not: MSSQL doesn't support arrays, AFAIK) Clearly nested and record tagged structures are popular in other database systems, but in the universe of SQL systems Postgres has always broken rank with tradition with most of its relational brethren (exception, due to similar parentage: Illustra and then Informix) when it came to data types -- for example, the long-standing inclusion of polygons, lines, and points.
One of the conditions of the 1NF is that each row and column contain one and only one value. While a NULL might be disputable, using multiple values as shown in the two examples (tags, items/price/quantity) are a clear violation of the 1NF. Only if the array data were truly "one value" (e.g., a vector) it would not be a violation. (I.e., I am not trying to contradict your statement, but I think it helps to add a bit of clarification.)