| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jnordwick 3069 days ago

I wouldn't call this a time series database at all. To me, tsdb implies analytics over a time dimension such as weather sensors or stock market data.

This is just patitioning a log on time so you can query the most recent and delete the old stuff.

I doesn't even really seem to me you necessarily want to partition on time since your load distribution is going to be terrible.

Edit: too add a little. There is a thing called a temporal database that is a little more general in usage i feel in that it is more about facts at specific points in time (such as your address last year) that i think this is more about.

There is even a bitemporal database that has two time dimensions (what do we think your last year's address is right now and what did we think your last year's address was yesterday - and in those you don't ever delete data that is wrong, you just update your belief about that point in time) and they are really interesting to work with. Those would seem much more similar to this.

2 comments

smilliken 3069 days ago

Thanks for mentioning bitemporal databases, I hadn't heard of anything like that before. Time-based mutable facts are so hard to represent well.

I think their definition of time-series database fits the common usage I've seen everywhere: the data has a time dimension and is append-only/immutable (well, ok, you can mutate the data in a postgresql table, but nobody's forcing you to).

Given the choice between selecting a specialized time-series only database or using a time-series pattern in your existing postgresql database, postgresql is often (usually?) the more pragmatic choice. That's what we do at mixrank with time-series tables approaching the 100 billions of rows.

link

sahil-kang 3069 days ago

I also feel that using the bitemporal pattern on a Postgres DB is the most pragmatic choice. What are some advantages to using a specialized timeseries DB? I can’t really think of any.

link

smilliken 3069 days ago

For one, you can avoid double-writing to disk by only having the log instead of the WAL/log + table. You can save space by using a more compact binary representation. Basically all performance/efficiency related.

link

nileshtrivedi 3069 days ago

How about a merkle tree structure for storing data (like Git does)? This would make it easy to find out what the snapshot at any given point of time was. Q is, whether it is powerful enough to support typical data-oriented applications?

link

sahil-kang 3069 days ago

Thanks, I didn’t consider the redundancy of the WAL. Maybe I’ll spend some time digging into the newer DB implementations/extensions.

link

onderkalaci 3069 days ago

> I doesn't even really seem to me you necessarily want to partition on time since your load distribution is going to be terrible.

I think that's not accurate. The tables mentioned in the post are first sharded/distributed on `repo_id`. Later, each shard is also partitioned on time dimension (i.e., `created_at `). Thus, the load should be distributed proportionally with the activity for each `repo_id`.

link