| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jeffdavis 5091 days ago

"The more you put in the database black box, the more scalability problems you will encounter in the future."

I don't see a obvious relationship between using arrays to store product tags and scalability.

What method of solving this problem are you suggesting, and what are the scalability characteristics of that solution?

Databases don't go out of their way to hurt scalability. Databases are shared state, and scaling that is just hard. Unless the problem is easy, in which case it's easy no matter what you use.

1 comments

eckyptang 5091 days ago

Looking at the way arrays are implemented, they can readily turn operations into O(M*N) rather than O(N). Even the PgSQL documentation suggests cases that will hurt you.

I would suggest that the relational model is maintained as there are no cases in which it isn't valid where arrays are and the optimiser is likely to come up with a better solution.

Relational databases (well all databases) have convenient tools which in the short term look good, but in the long term will hurt you.

Careful testing and feature selection is the alternative I am suggesting (from 20 years of experience with RDBMS platforms).

link

jeffdavis 5091 days ago

"can readily turn operations"

Which operations, specifically? We have the product tags example to work from, so it would help to be more concrete.

"into O(M*N) rather than O(N)"

I assume here that your target is "scalability to higher numbers of tags for a given product"? If so, please state that explicitly. Again, M and N aren't necessary, because we have an example to work from.

Are the operations still susceptible to worse scalability on this axis if you employ a GIN index?

"PgSQL documentation suggests cases that will hurt you."

Please elaborate.

"I would suggest that the relational model is maintained"

I don't see how your solution is relational and the other one is not. Can you please explain? How is using an array different from using, say, a string, which is essentially an array of characters? How is it different from using a date, the components of which you can access (e.g. grouping by the month)?

"the optimiser is likely to come up with a better solution."

Please elaborate.

link

eckyptang 5091 days ago

Ok here we go:

Table is a list of rows. To access a row is O(N) assuming perfect B-Tree. To access an element in an array is O(M) therefore to access an element of an array in a row is O(N*M) as you have to materialize the entire page that the array is in from disk. That array could be larger than a page which leads to more problems.

GIN is a possibility but the array elements are ordered which is overhead (they are a list, not a relational set) therefore there is a condition there if you want to reference by index which is the primary point of arrays. If not, you should use a set if they aren't ordered which means you should probably use a table and join anyway.

Fundamentally membership is a set relation, not a list relation. The order of the tags is not important.

Pain = PgSQL documentation 8.14.3 which references indices of arrays directly in a where clause.

A string - I wouldn't do string manipulation anyway in a where clause.

A date - that's a number which you can apply relops to easily.

The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents.

link

jeffdavis 5091 days ago

"To access an element in an array"

I guess that's the thing: why would you want to access array elements? What business problem is that solving?

"there is a condition there if you want to reference by index which is the primary point of arrays"

In this example, he's using an array more like a set. I don't think he intends for a tag at index 17 to mean something different than a tag of the same value at index 3.

I agree that postgresql should probably have a good "set" datatype, but at this time it does not and arrays are a reasonable way to implement sets in a lot of cases.

For instance, GIN allows fast searches on queries like "find the arrays that contain element 'X'". That's more like a set operation.

"The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents."

There are probably some cases where it's better and some where it's worse. Postgres does keep track of stats for array elements in addition to the arrays themselves, so it's not obvious to me that using a separate table would improve matters.

link