| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by derefr 2028 days ago

Let's start with views. A database view is a "stored query" that presents itself as a table, that you can further query against.

If you have a view "bar":

    CREATE VIEW bar AS $$
    SELECT x * 2 AS a, y + 1 AS b FROM foo
    $$

and then you `SELECT a FROM bar`, then the "question" you're really asking is just:

    SELECT a FROM (SELECT x * 2 AS a, y + 1 AS b FROM foo)

— which, with efficient query planning, boils down to

    SELECT x * 2 AS a FROM foo

It's especially important to note that the `y + 1` expression from the view definition isn't computed in this query. The inner query from the view isn't "compiled" — forced to be in some shape — but rather sits there in symbolic form, "pasted" into your query, where the query planner can then manipulate and optimize/streamline it further, to suit the needs of the outer query.

-----

To materialize something is to turn it from symbolic-expression form, into "hard" data — a result-set of in-memory row-tuples. Materialization is the "enumeration" in a Streams abstraction, or the "thunk" in a lazy-evaluation language. It's the master screw that forces all the activity dependent on it — that would otherwise stay abstract — to "really happen."

Databases don't materialize anything unless they're forced to. If you do a query like

    SELECT false FROM (SELECT * FROM foo WHERE x = 1)

...no work (especially no IO) actually happens, because no data from the inner query needs to be materialized to resolve the outer query.

Streaming data out of the DB to the user requires serialization [= putting the data in a certain wire format], and serialization requires materialization [= having the data available in memory in order to read and re-format it.] So whatever final shape the data returned from your outermost query has when it "leaves" the DB, that data will always get materialized. But other processes internal to the DB may sometimes require data to be materialized as well.

Materialization is costly — it's usually the only thing forcing the DB to actually read the data on disk, for any columns it wasn't filtering by. Many of the optimizations in RDBMSes — like the elimination of that `y + 1` above — have the goal of avoiding materialization, and the disk-reads / memory allocations / etc. that materialization requires.

-----

Those definitions out of the way, a "materialized view" is something that acts similar to a view (i.e. is constructed in terms of a stored query, and presents itself as a queriable table) but which — unlike a regular view — has been pre-materialized. The query for a matview is still stored, but at some point in advance of querying, the RDBMS actually runs that query, fully materializes the result-set from it, and then caches it.

So, basically, a materialized view is a view with a cached result-set.

Like any cache, this result-set cache increases read-time efficiency in the case where the original computation was costly. (There's no point in "upgrading" a view into a matview if your queries against the plain view were already cheap enough for your needs.)

But like any cache, it needs to be maintained, and can become out-of-sync with its source.

Although materialized views are part of the SQL standard, not all SQL RDBMSes implement them. MySQL/MariaDB does not, for example. (Which is why you'll find that much of the software world just pretends matviews don't exist when designing their DB architectures. If it ever needs to run on MySQL, it can't use matviews.)

The naive approach that some other RDBMSes (e.g. Postgres) take to materialized views, is to only offer manual, full-pass recalculation of the cached result-set, via some explicit command (`REFRESH MATERIALIZED VIEW foo`). This works with "small data"; but at scale, this approach can be so time-consuming for large and complex backing queries, that by the time cache is rebuilt, it's already out-of-date again!

Because there are RDBMSes that either don't have matviews, or don't have scalable matviews, many application developers just avoid the RDBMS's built in matview abstraction, and build their own. Thus, another large swathe of the world's database architecture either will use cron-jobs to regular run+materialize a query, and then dump its results back into a table in the same DB; or it will define on-INSERT/UPDATE/DELETE triggers on "primary" tables, that transform and upsert data into "secondary" denormalized tables. These are both approaches to "simulating" matviews, portably, on an RDBMS substrate that isn't guaranteed to have them.

Other RDBMSes (e.g. Oracle, SQL Server, etc.) do have scalable materialized views, a.k.a. "incrementally materialized" views. These work less like a view with a cache, and more like a secondary table with write-triggers on primary tables to populate it — but all handled under-the-covers by the RDBMS itself. You just define the matview, and the RDBMS sees the data-dependencies and sets up the write-through data flow.

Incrementally-materialized views are great for what they're designed for (reporting, mostly); but they aren't intended to be the bedrock for an entire architecture. Building matviews on top of matviews on top of matviews gets expensive fast, because even fancy enterprise RDBMSes like Oracle don't realize, when populating table X, that writing to X will in turn write to matview Y, which will in turn "fan out" to matviews {A,B,C,D}, etc. These RDBMS's matviews were never intended to support complex "dataflow graphs" of updates like this, and so there's too much overhead (e.g. read-write contention on index locks) to actually make these setups practical. And it's very hard for these DBMSes to change this, as their matviews' caches are fundamentally reliant on database table storage engines, which just aren't the right ADT to hold data with this sort of lifecycle.

-----

Materialize is an "RDBMS" (though it's not, really) engineered from the ground up to make these sorts of dataflow graphs of matviews-on-matviews-on-matviews practical, by doing its caching completely differently.

Materialize looks like a SQL RDBMS from the outside, but Materialize is not a database — not really. (Materialize has no tables. You can't "put" data in it!) Instead, Materialize is a data streaming platform, that caches any intermediate materialized data it's forced to construct during the streaming process, so that other consumers can work off those same intermediate representations, without recomputing the data.

If you've ever worked with Akka's Streams, or Elixir's Flows, or for that matter with Apache Beam (nee Google Dataflow), Materalize is that same kind of pipeline. But where all the plumbing work of creating intermediate representations — normally a procedural map/reduce/partition kind of thing — is done by defining SQL matviews; and where the final output isn't a fixed output of the pipeline, but rather comes from running an arbitrary SQL query against any arbitrary matview defined in the system.

3 comments

dragonwriter 2028 days ago

> Most RDBMSes (e.g. Postgres) only offer manual (`REFRESH MATERIALIZED VIEW foo`) full-pass recalculation of the cached result-set for matviews.

"Most" here seems very much wrong, at least of major products: Oracle has an option for on-commit (rather than manual) and incremental/incremental-if-possible (FAST/FORCED) refresh, so it is limited to neither only-manual nor only-full-pass recalculation. SQL Server indexed views (their matview solution) are automatically incrementally updated as base tables change, they don't even have an option for manual full-pass recalculation, AFAICT. DB2 materialized query tables (their matview solution) have an option for immediate (on-commit) refresh (not sure if the algo here is always full-pass, but its at a minimum not always manual.) Firebird and MySQL/MariaDB don't have any support for materialized views at all (though of course you can manually simulate them with additional tables updated by triggers.) Postgres seems to be the only major RDBMS with both material view support and the limitation of only on-demand full-pass recalculation of matviews (for that matter, except maybe DB2 having the full-pass limitation, it seems to be the only one with either the only-manual or only-full-pass limitation.)

link

jacques_chester 2028 days ago

I think that it's true that many databases offer incremental updates and it's incorrect to say that manual refreshes were the state of the art.

The important point is that Materialize can do it for almost any query, very efficiently, compared to existing options. That opens a lot of possibilities.

link

dragonwriter 2028 days ago

> The important point is that Materialize can do it for almost any query, very efficiently, compared to existing options. That opens a lot of possibilities.

Yes, this does seem like a very big deal.

link

derefr 2028 days ago

You're right; I updated my comment.

link

dragonwriter 2028 days ago

That was a fantastic and illuminating update, thank you.

link

quodlibetor 2028 days ago

This is a fantastic comment!

One small thing: we do now have tables[1]! At the moment they are ephemeral and only support inserts -- no update/delete. We will remove both of those limitations over time, though!

[1]: https://materialize.com/docs/sql/create-table/

link

jacques_chester 2028 days ago

This is an outstanding explanation. Much better than mine.

link