|
|
|
|
|
by jiggawatts
795 days ago
|
|
This copied the superficial data layout without the key benefit of modern columnar formats: segment elimination. Most such formats support efficient querying by skipping the disk read step entirely when a chunk of data is not relevant to a query. This is done by splitting the data into segments of about 100K rows, and then calculating the min/max range for each column. That is stored separately in a header or small metadata file. This allows huge chunks of the data to be entirely skipped if it falls out of range of some query predicate. PS: the same compression ratio advantages could be achieved by compressing columns stored as JSON arrays, but such a format could encode all Unicode characters and has a readily available decoder in all mainstream programming languages. |
|
> Price⇥⇥0 {rows:2, distinct:2, minvalue:111.11, maxvalue:222.22} 111.11⮐222.22⮐
> Price⇥⇥1 {rows:1, distinct:1, minvalue:333.33, maxvalue:333.33} 333.33⮐