| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by setr 265 days ago

A row-based index is a column-wise copy of the data, with mechanisms to skip forward during scanning. You maintain a separate copy of the column to support this, making indexes expensive, and thus the DBA is asked to maintain a minimal subset.

A columnar database’s index is simply laid out on top of the column data. If the column is the key, then it’s sorted by definition, and no index is really required outside of maybe a zone map, because you can binary search. A non-key column gets a zone map / skip index laid out on top, which is cheap to maintain… because it’s already a column-wise slice of the data.

You don’t often add indexes to an OLAP system because every column is indexed by default — because it’s cheap to maintain, because you don’t need a separate column-wise copy of the data because it’s already a column-wise copy of the data.

1 comments

SkiFire13 265 days ago

> A non-key column gets a zone map / skip index laid out on top, which is cheap to maintain… because it’s already a column-wise slice of the data.

I don't see how that's different from storing a traditional index. You can't just lay it on top of the column, because the column is stored in a different order than what the index wants.

link

setr 264 days ago

Zonemap / skip indexes don’t require sorting, still provide significantly improved searching over full tablescans, and typically applied to every column by default. Sorting is even better, just at the cost of a second copy of the dataset.

In a row-based rdbms, any indexing whatsoever is a copy of the column-data, so you might as well store it sorted every time. It’s not inherent to the definition.

link

SkiFire13 262 days ago

> Zonemap / skip indexes don’t require sorting

That's still a separate index though, no? It's not intrinsic in the column storage itself, although I guess it works best with it if you end up having to do a full-scan of the column section anyway.

> Sorting is even better, just at the cost of a second copy of the dataset. > ... > In a row-based rdbms, any indexing whatsoever is a copy of the column-data

So the same thing, no?

link

setr 262 days ago

I’m not saying columnar databases don’t have indexes, I’m saying that they get to have indexes for cheap (and importantly: without maintaining a separate copy of the data being indexed), to the point that every column is indexed by definition. It’s a separate data structure, but it’s not a separate db object exposed to the user — it’s just part of the definition

> So the same thing, no? Consider it as like: for a given filtered-query, a row-based storage is doing a table-scan if no index exists. There is no middle ground. Say 0% value or 100%.

A columnar database’s baseline is a decent index, and if there’s a sorted index then even better. Say 60% value vs 100%.

The relative importance of having a separate, explicit, sorted index is much lower in a columnar database, because the baseline is different. (Although maintaining extra sorted indexes is a columnar database is much more expensive — you basically have to keep a second copy of the entire table sorted on the new key(s))

link