| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by llampx 2033 days ago
	Question for the pros: In doing some data engineering work, I found that creating temporary tables and dropping them after the run was much more performant and memory-efficient than using CTEs. No other change was made to the queries in the CTE, just putting them in a separate CREATE TABLE AS... script before the part that needed the calculations. Why is this the case? Shouldn't CTEs be more efficient?

3 comments

eatonphil 2033 days ago

Until postgres 12 (if you were using postgres) CTEs were an optimization fence. Where filters would not get pushed down into the CTE if they were only specified outside the CTE (but in fields from the CTE).

https://paquier.xyz/postgresql-2/postgres-12-with-materializ...

link

deepsun 2033 days ago

Well, filters won't get pushed down to previous CREATE TABLE either.

link

jsmith45 2033 days ago

You already got a good response for postgres.

If you where speaking of MSSQL, it does treats non-recursive[1] CTE's as equivalent to to a view, and effectively inline it. So if things were faster with a temp table than a CTE it would mean that the query optimizer came up with a bad query plan. The most likely reason being that your query was complicated enough that you confused the cardinality estimator, so it ends up choosing really bad join types.

[1] Recursive queries in MSSQL are more complicated. They spool, but not necessarily all rows get spooled right away. So an infinite result recursive CTE does not necessarily fail. In any case predicate pushdown still occurs.

link

MSM 2033 days ago

Really depends on query complexity. Keep in mind that creating a CTE doesn't really materialize any of the resulting steps' data, it's just a way to encapsulate logic. Chaining together complex CTEs is effectively the same as creating layers of views, which can also be a performance problem.

In the case of a temp table, you're likely intelligently (manually) going through the process of whittling down the data to only what you need and in the most efficient ways possible. Maybe you're filtering by date upfront, then by some other filter, removing as many rows as is possible as quickly as possible. With the CTE(s), it might be creating a list of many filters with a lot of joins in one step, and the optimizer may not understand which filters can remove the most rows the most easily.

link

deepsun 2033 days ago

> creating a CTE doesn't really materialize any of the resulting steps' data

Well, at least with PostgreSQL -- all CTEs were always materialized. It was optimized only in the version 12.

link