The more you put in the database black box, the more scalability problems you will encounter in the future. It's more economical to scale out than buy bigger databases servers.
Happily, Postgres keeps improving performance with each release. According to benchmarks [1], the upcoming version 9.2 will scale up to 64 cores for read-heavy workloads. Unless your application is going to grow really fast -- keeping in mind that the hardware capabilities will continue to improve each year -- then it's just premature optimization to worry about horizontal scaling. Just upgrade your database server once a year and reap the benefits of Moore's law.
I love your optimism. The real world doesn't work like that. The real world punishes you for every shitty feature you pick and every bad chunk of code.
Our application is 15 years old, and we're on a 64-core machine with 768Gb of RAM and 35Tb of disk.
We're running at 80% capacity.
Where do we go from here? Yes, we rewrite and scale out for the measly cost of £450k. That cost would have been avoided with the appropriate due diligence. That is not a cost anyone wants to swallow.
Then again I don't work disposable CRUD applications...
"The more you put in the database black box, the more scalability problems you will encounter in the future."
I don't see a obvious relationship between using arrays to store product tags and scalability.
What method of solving this problem are you suggesting, and what are the scalability characteristics of that solution?
Databases don't go out of their way to hurt scalability. Databases are shared state, and scaling that is just hard. Unless the problem is easy, in which case it's easy no matter what you use.
Looking at the way arrays are implemented, they can readily turn operations into O(M*N) rather than O(N). Even the PgSQL documentation suggests cases that will hurt you.
I would suggest that the relational model is maintained as there are no cases in which it isn't valid where arrays are and the optimiser is likely to come up with a better solution.
Relational databases (well all databases) have convenient tools which in the short term look good, but in the long term will hurt you.
Careful testing and feature selection is the alternative I am suggesting (from 20 years of experience with RDBMS platforms).
Which operations, specifically? We have the product tags example to work from, so it would help to be more concrete.
"into O(M*N) rather than O(N)"
I assume here that your target is "scalability to higher numbers of tags for a given product"? If so, please state that explicitly. Again, M and N aren't necessary, because we have an example to work from.
Are the operations still susceptible to worse scalability on this axis if you employ a GIN index?
"PgSQL documentation suggests cases that will hurt you."
Please elaborate.
"I would suggest that the relational model is maintained"
I don't see how your solution is relational and the other one is not. Can you please explain? How is using an array different from using, say, a string, which is essentially an array of characters? How is it different from using a date, the components of which you can access (e.g. grouping by the month)?
"the optimiser is likely to come up with a better solution."
Table is a list of rows. To access a row is O(N) assuming perfect B-Tree. To access an element in an array is O(M) therefore to access an element of an array in a row is O(N*M) as you have to materialize the entire page that the array is in from disk. That array could be larger than a page which leads to more problems.
GIN is a possibility but the array elements are ordered which is overhead (they are a list, not a relational set) therefore there is a condition there if you want to reference by index which is the primary point of arrays. If not, you should use a set if they aren't ordered which means you should probably use a table and join anyway.
Fundamentally membership is a set relation, not a list relation. The order of the tags is not important.
Pain = PgSQL documentation 8.14.3 which references indices of arrays directly in a where clause.
A string - I wouldn't do string manipulation anyway in a where clause.
A date - that's a number which you can apply relops to easily.
The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents.
I guess that's the thing: why would you want to access array elements? What business problem is that solving?
"there is a condition there if you want to reference by index which is the primary point of arrays"
In this example, he's using an array more like a set. I don't think he intends for a tag at index 17 to mean something different than a tag of the same value at index 3.
I agree that postgresql should probably have a good "set" datatype, but at this time it does not and arrays are a reasonable way to implement sets in a lot of cases.
For instance, GIN allows fast searches on queries like "find the arrays that contain element 'X'". That's more like a set operation.
"The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents."
There are probably some cases where it's better and some where it's worse. Postgres does keep track of stats for array elements in addition to the arrays themselves, so it's not obvious to me that using a separate table would improve matters.
The general point still stands - scaling up is often a preferred solution and life proves it every day. Not so many companies enjoy facebook style grows, and even aren't viral at all.
[1] http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-abo...