Hacker News new | ask | show | jobs
by shrubble 2117 days ago
If I am understanding you properly, couldn't you do this with SQL by specifying the range of data that represents the timeseries you care about, selected via materialized view?

If you used a stored procedure to compute the range that becomes the view, then all you need to store are the parameters to feed to the stored procedure again, which data you could itself store in a separate table.

1 comments

As I said in a sibling comment — every version needs to be “hot” / “online” at the same time. The point of this system is to allow for random access to OLAP queries for arbitrary historical versions of the system; and, in fact, to even do time-series reports that perform a given analysis against every available version of the data, hopefully with some degree of parallelism. In matview terms, that means that every version of the data needs to be concurrently materialized.

Given just 100M keys (let’s call it a 20GB exported snapshot size), and 1M versions, that’s an overwhelming amount of data — and 99.9999% of it is redundant copies of the same information, i.e. the stuff that didn’t change between versions.

Solving the problem of the concurrent materializations requiring petabytes of storage for almost-entirely-redundant heap tuples, is essentially solving the problem of creating a tuple-deduplicating DBMS storage engine — which is equivalent to the problem of building a versioned embedded database :)

So use views instead of materialized view. How much time/effort could be saved by simply making better queries? Are you sure that you are looking for or controlling for the right problems?
The goal is to host an infrastructure to accelerate arbitrary user queries, ala a Business Intelligence data-warehouse backend. We don't get to specify what queries the users are doing. It's the classical problem that SQL RDBMSes were introduced to solve: having the data, and having to shape it in advance of knowledge of reporting workload.