Hacker News new | ask | show | jobs
by didgetmaster 1478 days ago
Each column is a set of key-value pairs. The values are stored such that searching for specific values or patterns are extremely fast, much like regular indexes do. You can drop a column from the table (or add a new one) without needing to export all the data; modify the schema; and re-import the data to the modified table. A table is just a 'group' of these key-value stores I call table columns.

For example, an address column (type=STRING) may have a value (e.g. '123 Main St.') mapped to key 234 (e.g. the row key). In fact it might have 2 or more values mapped to that same key (assuming the address column is not a primary key column). So unlike regular relational tables, you can have multiple values per column for any row. This is very similar to how NoSQL systems store Json documents where any value can be an array.

A query that wants to find all rows that start with a '1' or contain 'Main' or end in 'St.' will get key 234 (along with any other keys that match). Once the keys are found, it can retrieve the values for any other specified columns that are mapped to those keys and give you the whole row.

In essence, this columnar store DB either has no indexes or is nothing but indexes depending on how you look at it. The point is there is no duplication of data between a regular store and a separate index.

1 comments

So you have a primary key ("row key") and each column C is essentially its own table containing column C and the primary key column, with the index being on column C. You can see how there is data duplication here, right? Each primary key value is duplicated #columns times.

So we have a mapping from column value to row index. Is there an inverse mapping stored if I want to find all column values contained in a given row?

Yes there is duplication of keys since each column has a copy of each key where a value in that column is mapped to that key.

The column (key-value store) maps both ways. It is really fast to find all the keys that are mapped to a given value (or pattern) as well as find all the values that are mapped to a key.

You could say the key is a primary key since each key is unique, but a table can also have a 'Primary Key' column where a unique value can only be mapped to a single key and every row must have a value in that column.

This system is very efficient for sparse tables. Null values are essentially free since the key-value pair is just missing if a column does not have a value for the row. It takes some processing to determine the key is not mapped anywhere in the column, but it is very fast.