Hacker News new | ask | show | jobs
by derefr 2117 days ago
As I said in a sibling comment — every version needs to be “hot” / “online” at the same time. The point of this system is to allow for random access to OLAP queries for arbitrary historical versions of the system; and, in fact, to even do time-series reports that perform a given analysis against every available version of the data, hopefully with some degree of parallelism. In matview terms, that means that every version of the data needs to be concurrently materialized.

Given just 100M keys (let’s call it a 20GB exported snapshot size), and 1M versions, that’s an overwhelming amount of data — and 99.9999% of it is redundant copies of the same information, i.e. the stuff that didn’t change between versions.

Solving the problem of the concurrent materializations requiring petabytes of storage for almost-entirely-redundant heap tuples, is essentially solving the problem of creating a tuple-deduplicating DBMS storage engine — which is equivalent to the problem of building a versioned embedded database :)

1 comments

So use views instead of materialized view. How much time/effort could be saved by simply making better queries? Are you sure that you are looking for or controlling for the right problems?
The goal is to host an infrastructure to accelerate arbitrary user queries, ala a Business Intelligence data-warehouse backend. We don't get to specify what queries the users are doing. It's the classical problem that SQL RDBMSes were introduced to solve: having the data, and having to shape it in advance of knowledge of reporting workload.