Hacker News new | ask | show | jobs
by squigs25 4312 days ago
The implications for this extend beyond backing up your database.

Imagine a world where daily time-series data can be stored efficiently: This is a lesser known use case, but it works like this: I'm a financial company and I want to store 1000 metrics about a potential customer. Maybe the number of transactions in the past year, the number of defaults, the number of credit cards, etc.

Normally I would have to duplicate this row in the database every day/week/month/year for every potential customer. With some kind of git-like storing of diffs between the row today and the row yesterday, I could easily have access to time series information without duplicating unchanged information. This would accomplish MASSIVE storage savings.

FWIW efficiently storing time series data is big problem at my company. No off the shelf solution makes this easy for us right now, and we would rather throw cheap hard disk at the problem rather than expensive engineers.

6 comments

There are a lot of existing compression algorithms for time series data that do just this. I'm not sure how well any of these are implemented however. I think the problem is not necessarily how the data is stored, since that's fairly easy to fix with a bit of engineering effort if you're willing to write your own system. The harder part is rewriting query engines to take advantage of this sort of compression. Although ideally this could just be abstracted away by the storage layer.
Kx systems kdb+ does this incredibly quickly and easily. I'm sure OneTick, Vhayu and others do too, though I son't have experience with them.

If you insist on standard SQL databases for time series, you'll have a lot more pain

Have you looked at Datomic? It seems to fit your problem description well.
+1 for Datomic, seems to be right in the wheelhouse for this problem.
What about Cassandra? I believe it efficiently stores multiple time values for each (row, column) value as it changes. Google's BigTable design does this, and I believe you can use BigTable through Appengine.
Sounds like a case where Event-sourcing & CQRS might've been handy. (Not reality something you can easily bolt-on afterwards, though.]
Column-oriented databases virtually all feature this in the form of column compression (e.g. "repeat this value for the next 1000 rows"). And if you don't want column compression, they have sparse data filling/interpolation -- e.g. use the last available value from a time series. This is pretty much their bread and butter. Interpolation is essentially making the query engine smarter, so you don't end up in the situation you're apparently facing where you have to insert duplicate records purely to satisfy a simplistic join.

Back to this product (which appears to simply wholesale copy databases?), I use LVM for exactly what it is doing -- I create and rollback and access and update LVM snapshots of databases. The snapshots are instant, and in most situations the data duplications is very limited. LVM is one of the coolest, most under-appreciated facets of most Linux installs -- http://goo.gl/J2mIvG