Hacker News new | ask | show | jobs
by mildbyte 739 days ago
Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).

pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).

1 comments

This is definitely something I intend to fix.

My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally. Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well. Thanks for the pointers.

Have you seen duckdb_fdw (https://github.com/alitrack/duckdb_fdw)? IIRC it's built based on sqlite_fdw, but points the outbound queries to DuckDB instead of SQLite, and it does handle running aggregations inside of DuckDB. Could be useful.
This is great, thank so much! I'll see if I can I can integrate this and how it compares to parquet_fdw.