Materialized View: SQL Queries on Steroids

Y	Hacker News new \| ask \| show \| jobs

	Materialized View: SQL Queries on Steroids (dinesh.wiki)
	71 points by dineshgowda24 1308 days ago

8 comments

tantaman 1306 days ago

> Application can tolerate stale data...

Projects like https://readyset.io/, https://materialize.com/, https://github.com/mit-pdos/noria can keep your materialized views up to date as the underlying base tables change.

link

mirzap 1306 days ago

These look interesting. It's worth mentioning that incremental view maintenance is coming to PostgreSQL. For now that feature (from same contributor) is available as extension https://github.com/sraoss/pg_ivm

link

garyclarke27 1306 days ago

Hope it comes soon, this would be the most important Postgres enhancement for many years.

link

zX41ZdbW 1306 days ago

ClickHouse updating materialized views in realtime.

How it is done:

1. Intermediate states of aggregate functions are first-class data types in ClickHouse. You can create a data type for the intermediate state of count(distinct), or quantile, whatever, and it will store a serialized state. It can be inserted into a table, read back, merged or finalized, and processed further.

2. Materialized views are simple triggers on INSERT. It enables incremental aggregation in realtime with materialized views.

Downsides:

Limited area of application. Does not respect multi-table queries if updates are done on more than one table.

Disclaimer: I'm developer of ClickHouse: https://github.com/ClickHouse/ClickHouse

link

CHY872 1306 days ago

How does Clickhouse handle downstream aggregations? E.g...

  SELECT AVG(col) FROM (SELECT MAX(col2) AS col FROM table GROUP BY col3) t;

type query? Essentially, materialized views which are very unstable? Flink has its retract streams where rows can be semantically removed from the output and downstream query plans can understand deletes, but my expectation has been that these have bad worst case performance.

link

zX41ZdbW 1306 days ago

It does not handle deletes / updates of the source table at all - it means the data of the materialized view become unaffected. Nothing sophisticated...

About the multi-level aggregate queries - the only way is to define a materialized view for the inner query and do calculations on top of it on the fly. So, the data for the inner query will be pre-aggregated, but nothing on top of that.

link

uvdn7 1306 days ago

The biggest difference between automatically keeping an MV up to date vs keeping indices up to date is that the write amplification of the latter is a function of the index count you have, while the former is a function of data and query. It’s easy to come up with cases when users update a single db row, and you end up having to update millions of rows in a MV (eg the every row in the MV has a name “Rob” and Bob changes his name).

I read the timely dataflow which underpins materialize.com. It seems like we don’t necessarily need the support of loops, which timely dataflow allows, for regular SQL, which is a DAG of operators. It appears that as long as the database supports snapshot reads, one can have a push-based query execution to enable incremental MV updates. The problem, I think, is still in the write-amp-as-a-function-of-data, which is unbounded. It is very cool regardless.

The technique can be used for cache invalidation as well, given the data cached needs to be described in SQL, which seems reasonable.

link

Bishonen88 1306 days ago

No third party tools etc. are needed for that. A trigger can update a materialized view whenever the source table changes.

link

jpalomaki 1306 days ago

This causes the whole materialized view to update. If the underlying query is complex (which it usually is - otherwise you would not need the view), you often don't want to rerun it for every update.

With the separate tools the idea is to stream the incoming updates and keep the materialized views up to date all the time.

link

egeozcan 1306 days ago

How does it know which rows would be affected how when we have something complex? Even better: Is there a detailed proposal or a paper to allow me to gawk at the intelligence of other people? :)

link

tazjin 1306 days ago

Yes, there is!

https://jon.thesquareplanet.com/papers/phd-thesis.pdf

This is the paper on which Readyset and Noria are based.

link

dineshgowda24 1303 days ago

I am unable to open the link. Do you mind sharing a working one? Thanks.

link

BillinghamJ 1306 days ago

Admittedly not an application database, but BigQuery also has materialized views that never return stale data

It also rewrites queries to make use of any materialized views you have, if it would make the query more efficient without affecting the results (so covers some use cases which indices would cover)

https://cloud.google.com/bigquery/docs/materialized-views-in...

link

beckingz 1304 days ago

These are unfortunately limited to a tiny subset of SQL.

link

fifilura 1306 days ago

Another option that I prefer, when you expect to update the data with regular intervals, is to add another column with a timestamp and simply create a new copy of the table contents with the updated data for a new timestamp.

That way you can always go back in time. And storage is often cheap.

link

zozbot234 1306 days ago

If you auto-update your query-derived data at every commit on the underlying tables, isn't that just an index? In a way, true materialized views are defined by the possibility of including stale data.

link

atombender 1306 days ago

The challenge with materialization views is how to determine how any given row change should modify the view. For simple queries with a predicate or two, this isn't too hard; as you say, it's similar to an index. But views can contain complex joins, which isn't true for indexes (or I've never seen a database that supports joins in indexes), and which makes them much harder to materialize, since you essentially have to track dependencies between rows. Views can also use LIMIT and ORDER BY; you can have a "top 100 most listened to songs" or whatever, and determining whether a view at any point of create/update/delete should include or not include a specific song is also a challenging problem in this space. Window functions, grouping aggregation, and other SQL features also make materializing a hard problem.

link

zozbot234 1306 days ago

Generally you would use indexes to speed up a complex join, to the point where the overhead of the join itself would just not matter and there would be no point to having an index on it. Many databases also support partial indexes already, which could go a long way towards supporting the "top # rows" use case.

link

kogir 1306 days ago

Materialized views are a much more powerful tool when used on databases that properly support them.

When I’ve used them the database has transparently handled updating them efficiently as the source data changes.

link

hardwaresofton 1306 days ago

There’s awesome work being done on incremental view maintenance in postgres:

https://github.com/sraoss/pg_ivm

link

dineshgowda24 1303 days ago

TIL, this is amazing. Thanks for the link

link

habibur 1306 days ago

If materialized view doesn't refresh data on insert/update, this basically works similar to a cache.

Caching the front page, server side, instead of rendering it on every request would have given similar result -- my guess. Specially when the query was taking 900ms.

link

zdragnar 1306 days ago

If you only have a single application in front of it, that's a valid approach. If you have multiple applications- maybe a separate reporting tool, for example- you now have the problem of maintaining and invalidating multiple caches.

Much easier to keep the cache located with the data since the cache changes when the data does.

link

mamcx 1306 days ago

For when you need real-time updates, triggers + denormalized table is easy and supported everywhere.

One of my apps in fact use that table as primary (read) for most things and post the changes to the base tables. Was far faster than the normal way.

link

ris 1306 days ago

This is one way to do it.

You might also be able to denormalize the categories down to an integer-array column and build a GIN or GiST index on that. I'd probably use triggers to automatically keep the integer-arrays up to date, but there are a number of ways it could be done if you don't like triggers.

link

masukomi 1306 days ago

in my XP materialized views are just another layer of caching, and caching consistence is, of course, one of the bigger problems in computing.

It's a great tool, but also one that can easily become a source of its own problems. One big issue is that devs stop caring because "the materialized view is fast" and you end up with materialized views that take 20 minutes to generate, and thus are 20 mins out of date the moment they come online. I'm not being hyperbolic.

DB Normalization is good, but sometimes it brings its own problems. Sometimes duplication is "better". Sometimes you should ask yourself "is a document db like MongoDB a better solution for our needs?"

link

canada2020 1306 days ago

But when do you refresh the materialized views?

link

quickthrower2 1306 days ago

In SQL Server it is part of the transaction that affected the underlying data. So you don’t worry about that. Instead you worry about lock escalations!

link

yCombLinks 1306 days ago

Lots of strategies are possible. I may only need to refresh the data in it once a day. I may need to refresh it on every write. It depends on the applications needs.

link

canada2020 1306 days ago

I see. Given it’s so flexible, is this pretty much like a caching layer?

link

dineshgowda24 1306 days ago

For us, real-time refreshes weren't necessary. So we ended up refreshing it every 12 hours through application crons.

link

smitty1e 1306 days ago

Presumably when underlying data are updated.

link