Hacker News new | ask | show | jobs
by llampx 1986 days ago
Question for the pros: In doing some data engineering work, I found that creating temporary tables and dropping them after the run was much more performant and memory-efficient than using CTEs. No other change was made to the queries in the CTE, just putting them in a separate CREATE TABLE AS... script before the part that needed the calculations.

Why is this the case? Shouldn't CTEs be more efficient?

3 comments

Until postgres 12 (if you were using postgres) CTEs were an optimization fence. Where filters would not get pushed down into the CTE if they were only specified outside the CTE (but in fields from the CTE).

https://paquier.xyz/postgresql-2/postgres-12-with-materializ...

Well, filters won't get pushed down to previous CREATE TABLE either.
You already got a good response for postgres.

If you where speaking of MSSQL, it does treats non-recursive[1] CTE's as equivalent to to a view, and effectively inline it. So if things were faster with a temp table than a CTE it would mean that the query optimizer came up with a bad query plan. The most likely reason being that your query was complicated enough that you confused the cardinality estimator, so it ends up choosing really bad join types.

[1] Recursive queries in MSSQL are more complicated. They spool, but not necessarily all rows get spooled right away. So an infinite result recursive CTE does not necessarily fail. In any case predicate pushdown still occurs.

Really depends on query complexity. Keep in mind that creating a CTE doesn't really materialize any of the resulting steps' data, it's just a way to encapsulate logic. Chaining together complex CTEs is effectively the same as creating layers of views, which can also be a performance problem.

In the case of a temp table, you're likely intelligently (manually) going through the process of whittling down the data to only what you need and in the most efficient ways possible. Maybe you're filtering by date upfront, then by some other filter, removing as many rows as is possible as quickly as possible. With the CTE(s), it might be creating a list of many filters with a lot of joins in one step, and the optimizer may not understand which filters can remove the most rows the most easily.

> creating a CTE doesn't really materialize any of the resulting steps' data

Well, at least with PostgreSQL -- all CTEs were always materialized. It was optimized only in the version 12.