| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by manigandham 2777 days ago

You don't need to add new columns.

   CREATE TABLE metrics (metric_name text, ts timestamp, properties json, key(metric_name, ts))

OLAP queries with SQL are very good at handling whatever dimensions you want.

2 comments

chaotic-good 2777 days ago

'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.

Single table design will be prone to high read/write amplification due to data alignment. Usually, you need to read many series at once so your query will turn into full table scan. Or it will read a lot of unneeded data which happened to be located near the data you need. Writes will be slow since your key starts with metrics name. Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.

Cardinality of the table will go through the roof, BTW. Every data point will add the key. Good luck dealing with this.

link

geocar 2776 days ago

> 'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.

For the life of me, I can't figure out why this would be a good idea. I feel like I must not understand what you're saying:

If I've got a million disks that I want to draw usage graphs for, why I would put each one in a separate column?

What's the business use-case you're imagining?

> Usually, you need to read many series at once so your query will turn into full table scan.

Why do I need a full table scan if I'm going to draw some graphs?

I've got something like 4000 pixels across my screen; I could supersample by 100x and still be pulling down less data than the average nodejs/webpack app.

> Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.

No that's definitely not what manigandham is suggesting. One million disks each reporting their usage means a million rows in two columns (disk name/sym, and volume) would be written (relatively) linearly.

link

chaotic-good 2776 days ago

Modern TSDB is expected to support tags. This means that every series will have a unique set of tag-value pairs associated with it. E.g. 'host=Foo OS=CentOS arch=amd64 ssdmodel=Intel545 ...'. And in the query you could pinpoint relevant series by this tags thus the tags should be searchable. For instance, I may want to see how specific SSD model performs on specific OS or specific app. If the set of tags is stored as a json in one field such queries wouldn't work efficiently.

About that 1M writes thing. You have two options. 1) Organize data by metric name first, or 2) by timestamp. In case of 2) the updates will be linear but reads will have huge amplification. In case of 1) updates will be random, but reads will be fast.

link

manigandham 2777 days ago

You are talking about regular relational databases. I'm talking about distributed column-oriented databases. Big difference.

You can store tags and other data in JSON/ARRAY columns. The primary key is used for automatically sharding and sorting.

Groups of rows are sorted, split into columns, compressed, and stored as partitions with metadata. This means you can 'scan' the entire table in milliseconds using metadata and then only open the partitions, and the columns inside, that you actually need for your query. There are no random writes either, it's all constant sequential I/O with optional background optimization. And because of compression, storing the same key millions of times has no real overhead.

As stated several times before, we deal with this everyday on trillion row tables inserting 100s of billions of rows daily. Queries run in seconds. We do just fine.

link

brylie 2777 days ago

We have a lot of data stored in Postgres JSON fields at my work. Around three months ago, we were trying to optimize some queries by adding sub-key indexing to the JSON field. We tried multiple times, but Postgres seemed to keep using sequential scan on the records, rather than the JSON index. So, we just decided to normalize the data and use proper foreign key fields for query performance.

link