Hacker News new | ask | show | jobs
by code_biologist 2351 days ago
Overall, I love this approach, but the big pain for me is the lack of incremental view refreshes. I end up needing to recreate giant tables each refresh even though the underlying changes are small. Sure I could implement that myself, but that sacrifices the correctness guarantees that materialized views provide!

Two things that would help:

1. Getting Incremental View Maintenance (IVM) [1] into Postgres. It looks like work is beginning on this but it's been 7+ years coming. If there are any Postgres devs looking at this, I'm cheering for you!

2. There's a commercial group doing an implementation of this, Materialize [2], but they don't have any publicly released product yet. I mention it because their interview on Data Engineering Podcast is really good [3] and I really like their focus on correctness by working from replication logs.

[1] https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

[2] https://materialize.io/

[3] https://www.dataengineeringpodcast.com/materialize-streaming...

4 comments

Likely too boring and old school data warehousing, but this is what the notion of rollup/aggregate tables achieve. You define a table that's the result of some sort of an aggregation query (time is almost always involved: minutely, hourly, daily, monthly plus other business-specific dimensions) and at the conclusion of each corresponding interval you insert the next respective batch.

    e.g. insert into rollup_clicks_hourly select hour, count(*) clicks from clicks where created_at >= ? and created_at < ? group by 1
Then from a reporting perspective, users/dashboards/etc read from the highest relevant table that meets their criteria.
I worked with a company who were doing (basic)analytics on their live ERP db, sales and margin reporting etc. There was a view written to join in all of the common tables so the in house people could query it easily, things like product names, customer names and addresses etc. When performance fell off they materialised the view as a stop gap... until they got around to a data warehouse. 10 years later it is still there, building each night! It was actually the perfect solution for them, simple analytics with no more infrastructure!
That's a very clever concept.

Could you recommend any classic old-school books about data warehousing that you'vre read that teach more such techniques?

Thank you!

Yup, exactly. Kimball's stuff is the best. You can achieve on a modest machine what modern techniques would require an incredibly expensive horizontally scaled MPP database. It does require a lot more planning and forethought to be certain.

https://www.amazon.com/dp/product/1118530802 and possibly https://www.amazon.com/dp/0764567578/

You can do effectively the same thing using a physical table and triggers [1] if you want a pure postgres solution. I ended up with the same problem and just ended up using a physical table and implemented the updating logic at our app layer. Not exactly fun or easy to implement all the updating logic yourself though...

[1] https://hashrocket.com/blog/posts/materialized-view-strategi...

I'm very hopeful for approaches that model around this need from the ground up. Big Data distributed stream processing frameworks like Spark or Flink allow for standing SQL queries that are incrementally build; and you can even subscribe to react to the results.

There are other initiatives for plugging it on existing RDBMSs, like Noria [1] for MySQL. It allows for subscribing to changes; and also for lazy evaluation of the materializing view rows, keeping them in cache for later.

[1] https://github.com/mit-pdos/noria/blob/master/README.md

Like you said, not easy: Recreating the functionality of views using triggers is error prone and a heavy maintenance burden, and the triggers have to be implemented differently based on what each query is like... In a lot of cases it's better to just optimize the query and indexing.

In PG you can make a lot of queries run straight from one index if you tailor an index to a query, since you can index your own plpgsql functions.

Materialized views are fundamentally just some syntactical sugar over the results of a query cached in a table.

The syntactical sugar does have advantages, but comes at a cost of flexibility - if you need more control over the update process or modification of the data than a full refresh, then you should just use a table (create table whatever as select * from ...).

It also seems odd to me that the refresh is scheduled in the app instead of the database.

Can anybody here speak to that choice?

Yeah. Coming from a Django world it's about avoiding surprises for application programmers. I suspect it's the same in Rails. Historically Django has almost entirely had logic at the application level rather than pushing it into the DB: scheduled jobs, data validation, trigger-ish logic (via `post_save` signals and such).

Part of the power of OP's technique is that it looks exactly like a standard Django/Rails data model with a tiny sprinkle of magic -- no surprises. It's surprising for a Django programmer to hear "all periodic tasks are handled via celery, except these table refreshes" or "signals are responsible all work in response to model changes, except these triggers".

Obviously in some cases you need a trigger for correctness, but in general I try to stick to the conventions of the ecosystem.

Do you mean using something like `pg_cron` instead of calling it through Scenic and a cron fired at the app level?

I personally don't like having my cron distributed in different areas. All my crons are in one place/system like easycron.com or setcronjob.com or whenever. Performance is not a consideration when deciding to run the job directly from the DB or the App.

I answered the question sentence directly, but maybe that's not what you were asking?

Yeah, pg_cron, then. Updating a materialized view feels like a data integrity issue which should be addressed within the DB itself, if not the DDL for the view itself.