These look interesting. It's worth mentioning that incremental view maintenance is coming to PostgreSQL. For now that feature (from same contributor) is available as extension
https://github.com/sraoss/pg_ivm
ClickHouse updating materialized views in realtime.
How it is done:
1. Intermediate states of aggregate functions are first-class data types in ClickHouse. You can create a data type for the intermediate state of count(distinct), or quantile, whatever, and it will store a serialized state. It can be inserted into a table, read back, merged or finalized, and processed further.
2. Materialized views are simple triggers on INSERT. It enables incremental aggregation in realtime with materialized views.
Downsides:
Limited area of application. Does not respect multi-table queries if updates are done on more than one table.
How does Clickhouse handle downstream aggregations? E.g...
SELECT AVG(col) FROM (SELECT MAX(col2) AS col FROM table GROUP BY col3) t;
type query? Essentially, materialized views which are very unstable? Flink has its retract streams where rows can be semantically removed from the output and downstream query plans can understand deletes, but my expectation has been that these have bad worst case performance.
It does not handle deletes / updates of the source table at all - it means the data of the materialized view become unaffected. Nothing sophisticated...
About the multi-level aggregate queries - the only way is to define a materialized view for the inner query and do calculations on top of it on the fly. So, the data for the inner query will be pre-aggregated, but nothing on top of that.
The biggest difference between automatically keeping an MV up to date vs keeping indices up to date is that the write amplification of the latter is a function of the index count you have, while the former is a function of data and query. It’s easy to come up with cases when users update a single db row, and you end up having to update millions of rows in a MV (eg the every row in the MV has a name “Rob” and Bob changes his name).
I read the timely dataflow which underpins materialize.com. It seems like we don’t necessarily need the support of loops, which timely dataflow allows, for regular SQL, which is a DAG of operators. It appears that as long as the database supports snapshot reads, one can have a push-based query execution to enable incremental MV updates. The problem, I think, is still in the write-amp-as-a-function-of-data, which is unbounded. It is very cool regardless.
The technique can be used for cache invalidation as well, given the data cached needs to be described in SQL, which seems reasonable.
This causes the whole materialized view to update. If the underlying query is complex (which it usually is - otherwise you would not need the view), you often don't want to rerun it for every update.
With the separate tools the idea is to stream the incoming updates and keep the materialized views up to date all the time.
How does it know which rows would be affected how when we have something complex? Even better: Is there a detailed proposal or a paper to allow me to gawk at the intelligence of other people? :)
Admittedly not an application database, but BigQuery also has materialized views that never return stale data
It also rewrites queries to make use of any materialized views you have, if it would make the query more efficient without affecting the results (so covers some use cases which indices would cover)
Another option that I prefer, when you expect to update the data with regular intervals, is to add another column with a timestamp and simply create a new copy of the table contents with the updated data for a new timestamp.
That way you can always go back in time. And storage is often cheap.
If you auto-update your query-derived data at every commit on the underlying tables, isn't that just an index? In a way, true materialized views are defined by the possibility of including stale data.
The challenge with materialization views is how to determine how any given row change should modify the view. For simple queries with a predicate or two, this isn't too hard; as you say, it's similar to an index. But views can contain complex joins, which isn't true for indexes (or I've never seen a database that supports joins in indexes), and which makes them much harder to materialize, since you essentially have to track dependencies between rows. Views can also use LIMIT and ORDER BY; you can have a "top 100 most listened to songs" or whatever, and determining whether a view at any point of create/update/delete should include or not include a specific song is also a challenging problem in this space. Window functions, grouping aggregation, and other SQL features also make materializing a hard problem.
Generally you would use indexes to speed up a complex join, to the point where the overhead of the join itself would just not matter and there would be no point to having an index on it. Many databases also support partial indexes already, which could go a long way towards supporting the "top # rows" use case.
If materialized view doesn't refresh data on insert/update, this basically works similar to a cache.
Caching the front page, server side, instead of rendering it on every request would have given similar result -- my guess. Specially when the query was taking 900ms.
If you only have a single application in front of it, that's a valid approach. If you have multiple applications- maybe a separate reporting tool, for example- you now have the problem of maintaining and invalidating multiple caches.
Much easier to keep the cache located with the data since the cache changes when the data does.
You might also be able to denormalize the categories down to an integer-array column and build a GIN or GiST index on that. I'd probably use triggers to automatically keep the integer-arrays up to date, but there are a number of ways it could be done if you don't like triggers.
in my XP materialized views are just another layer of caching, and caching consistence is, of course, one of the bigger problems in computing.
It's a great tool, but also one that can easily become a source of its own problems. One big issue is that devs stop caring because "the materialized view is fast" and you end up with materialized views that take 20 minutes to generate, and thus are 20 mins out of date the moment they come online. I'm not being hyperbolic.
DB Normalization is good, but sometimes it brings its own problems. Sometimes duplication is "better". Sometimes you should ask yourself "is a document db like MongoDB a better solution for our needs?"
In SQL Server it is part of the transaction that affected the underlying data. So you don’t worry about that. Instead you worry about lock escalations!
Lots of strategies are possible. I may only need to refresh the data in it once a day. I may need to refresh it on every write. It depends on the applications needs.
Projects like https://readyset.io/, https://materialize.com/, https://github.com/mit-pdos/noria can keep your materialized views up to date as the underlying base tables change.