Hacker News new | ask | show | jobs
by jeffdavis 5043 days ago
"can readily turn operations"

Which operations, specifically? We have the product tags example to work from, so it would help to be more concrete.

"into O(M*N) rather than O(N)"

I assume here that your target is "scalability to higher numbers of tags for a given product"? If so, please state that explicitly. Again, M and N aren't necessary, because we have an example to work from.

Are the operations still susceptible to worse scalability on this axis if you employ a GIN index?

"PgSQL documentation suggests cases that will hurt you."

Please elaborate.

"I would suggest that the relational model is maintained"

I don't see how your solution is relational and the other one is not. Can you please explain? How is using an array different from using, say, a string, which is essentially an array of characters? How is it different from using a date, the components of which you can access (e.g. grouping by the month)?

"the optimiser is likely to come up with a better solution."

Please elaborate.

1 comments

Ok here we go:

Table is a list of rows. To access a row is O(N) assuming perfect B-Tree. To access an element in an array is O(M) therefore to access an element of an array in a row is O(N*M) as you have to materialize the entire page that the array is in from disk. That array could be larger than a page which leads to more problems.

GIN is a possibility but the array elements are ordered which is overhead (they are a list, not a relational set) therefore there is a condition there if you want to reference by index which is the primary point of arrays. If not, you should use a set if they aren't ordered which means you should probably use a table and join anyway.

Fundamentally membership is a set relation, not a list relation. The order of the tags is not important.

Pain = PgSQL documentation 8.14.3 which references indices of arrays directly in a where clause.

A string - I wouldn't do string manipulation anyway in a where clause.

A date - that's a number which you can apply relops to easily.

The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents.

"To access an element in an array"

I guess that's the thing: why would you want to access array elements? What business problem is that solving?

"there is a condition there if you want to reference by index which is the primary point of arrays"

In this example, he's using an array more like a set. I don't think he intends for a tag at index 17 to mean something different than a tag of the same value at index 3.

I agree that postgresql should probably have a good "set" datatype, but at this time it does not and arrays are a reasonable way to implement sets in a lot of cases.

For instance, GIN allows fast searches on queries like "find the arrays that contain element 'X'". That's more like a set operation.

"The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents."

There are probably some cases where it's better and some where it's worse. Postgres does keep track of stats for array elements in addition to the arrays themselves, so it's not obvious to me that using a separate table would improve matters.