| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by toast0 4927 days ago
	possibly dumb question; what's the difference between columnar storage and a bunch of row-major tables with an id column, and one column of data?

3 comments

grundprinzip 4927 days ago

The biggest advantage of a pure column-oriented DBMS comes from having the positional information indirectly available without the requirement to store this ID information. During query execution the required position lists can then be generated.

In addition. If you considere the record format of traditional row-oriented databases you will see that the overhead of storing a single attribute record is rather high. Since with column-oriented DBMS its all about IO performance (Disk/Memory, Memory/CPU) such overhead can diminish the advantage.

Thus typical column stores tend to use only single strings of sequential memory to store the data. This can even be enhanced by applying dictionary compression and as a result only storing integer values. And modern CPUs are good in processing lots of them.

link

chouxiaowen 4927 days ago

The id column and row header (as depicted in the post) can significantly waste I/O bandwidth.

See the paper "Column-stores vs. row-stores: how different are they really?" in SIGMOD 2008 for performance comparisons between C-store and approaches of emulating column-store in row-store databases.

link

jacques_chester 4927 days ago

grundprinzip's answer is good. I'd like to add a remark.

In database systems it is important to distinguish between the logical and physical models.

When you design a relational database, you focus on the correct logical model.

"People can have multiple phone numbers".

"Phone numbers belong to a single phone".

"Mobile phones are possessed by one person. Landlines can be shared".

And so on. You express this logical model to the database, most likely in SQL.

Eventually you notice that query X is slow. Your first step is to check that your logical design was sound, because poorly designed schemata are hard for query planners to reason correctly about.

Then you start doing things to the physical representation. You say stuff like:

"I look up by phone numbers a lot."

Or, in SQL terms, you add an index to a column.

Similarly, as this article pointed out, there are times when grouping data by column rather than row is advantageous. So then you tell the database to use a columnar store.

And so on. Modern RDBMSes all support the same major logical model descriptions; but they can vary widely on what physical directives you give.

link