| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fusiongyro 5090 days ago

I think you're getting hung up on the low-level details and missing the big relational picture.

Timestamps are atomic units. If I add 60 minutes to 10:15 AM I don't want to see 10:75 AM, I want to see the 11:15. This is because the "components" of the timestamp are not independent values. Likewise, a string may be an array of characters, but a name is an atomic entity. It's meaningful to break a string into characters (low-level) but it's not meaningful to break a name into name-characters.

Now, if I had a reason to make a relationship between one entity and many characters, the correct way to model it in the relational database is to use a one-to-many with a character column. Using a string instead is not the right thing to do. Strings, for starters, have an ordering, and sets do not. Indexing that string column probably won't perform as well as indexing the separate table, because string indexes care about that order. It may perform worse, or model the situation worse, for a variety of reasons. In practice, I've never seen anyone make a relationship between a character and an entity.

The distinction between when I'd use a separate row for each character and when I'd use a string is exactly the distinction between when using an array is 1NF and when it isn't. I don't think the idea of avoiding repeating groups says anything concrete about data types. I'd be surprised if the theorists spent a great deal of time discussing the philosophical ramifications of strings being composed of characters or what each type means on-disk; at least 40% of every book by Joe Celko consists of admonishment to avoid worrying about the physical layout. I'm sorry if this comment isn't sufficiently formal for you, but it's the best you're going to get from me. Lord knows this won't be the first time database theory and practice have not matched up to everyone's satisfaction. Maybe next we can discuss implementing transitive closures in portable SQL?

Back to arrays, the basic loss of flexibility is portability. If you modeled it traditionally with tables, your result would be portable to any RDBMS, including whatever indexes and constraints you made. By modeling it with arrays, you're limiting yourself to Postgres, and you're probably forcing yourself to use check constraints to ensure uniqueness within each array (if you care). The second loss of flexibility is that generic database access libraries probably don't support arrays natively. Hibernate, for example, does not "do" arrays, so you'll either be bypassing it altogether or converting them to strings and parsing them on the way out. These are the reasons I have off the top of my head; I wouldn't be shocked if there were others.

I don't think I'm going to see a third option besides arrays and normalization, but in this case, it would almost always be fine for the developer to err on the side of normalization. Are there uses for them? Sure. Just not many.