| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by manigandham 2777 days ago

You are talking about regular relational databases. I'm talking about distributed column-oriented databases. Big difference.

You can store tags and other data in JSON/ARRAY columns. The primary key is used for automatically sharding and sorting.

Groups of rows are sorted, split into columns, compressed, and stored as partitions with metadata. This means you can 'scan' the entire table in milliseconds using metadata and then only open the partitions, and the columns inside, that you actually need for your query. There are no random writes either, it's all constant sequential I/O with optional background optimization. And because of compression, storing the same key millions of times has no real overhead.

As stated several times before, we deal with this everyday on trillion row tables inserting 100s of billions of rows daily. Queries run in seconds. We do just fine.