| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by geocar 2781 days ago

> 'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.

For the life of me, I can't figure out why this would be a good idea. I feel like I must not understand what you're saying:

If I've got a million disks that I want to draw usage graphs for, why I would put each one in a separate column?

What's the business use-case you're imagining?

> Usually, you need to read many series at once so your query will turn into full table scan.

Why do I need a full table scan if I'm going to draw some graphs?

I've got something like 4000 pixels across my screen; I could supersample by 100x and still be pulling down less data than the average nodejs/webpack app.

> Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.

No that's definitely not what manigandham is suggesting. One million disks each reporting their usage means a million rows in two columns (disk name/sym, and volume) would be written (relatively) linearly.

1 comments

chaotic-good 2780 days ago

Modern TSDB is expected to support tags. This means that every series will have a unique set of tag-value pairs associated with it. E.g. 'host=Foo OS=CentOS arch=amd64 ssdmodel=Intel545 ...'. And in the query you could pinpoint relevant series by this tags thus the tags should be searchable. For instance, I may want to see how specific SSD model performs on specific OS or specific app. If the set of tags is stored as a json in one field such queries wouldn't work efficiently.

About that 1M writes thing. You have two options. 1) Organize data by metric name first, or 2) by timestamp. In case of 2) the updates will be linear but reads will have huge amplification. In case of 1) updates will be random, but reads will be fast.