| Ok here we go: Table is a list of rows. To access a row is O(N) assuming perfect B-Tree. To access an element in an array is O(M) therefore to access an element of an array in a row is O(N*M) as you have to materialize the entire page that the array is in from disk. That array could be larger than a page which leads to more problems. GIN is a possibility but the array elements are ordered which is overhead (they are a list, not a relational set) therefore there is a condition there if you want to reference by index which is the primary point of arrays. If not, you should use a set if they aren't ordered which means you should probably use a table and join anyway. Fundamentally membership is a set relation, not a list relation. The order of the tags is not important. Pain = PgSQL documentation 8.14.3 which references indices of arrays directly in a where clause. A string - I wouldn't do string manipulation anyway in a where clause. A date - that's a number which you can apply relops to easily. The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents. |
I guess that's the thing: why would you want to access array elements? What business problem is that solving?
"there is a condition there if you want to reference by index which is the primary point of arrays"
In this example, he's using an array more like a set. I don't think he intends for a tag at index 17 to mean something different than a tag of the same value at index 3.
I agree that postgresql should probably have a good "set" datatype, but at this time it does not and arrays are a reasonable way to implement sets in a lot of cases.
For instance, GIN allows fast searches on queries like "find the arrays that contain element 'X'". That's more like a set operation.
"The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents."
There are probably some cases where it's better and some where it's worse. Postgres does keep track of stats for array elements in addition to the arrays themselves, so it's not obvious to me that using a separate table would improve matters.