Hacker News new | ask | show | jobs
by minimaxir 596 days ago
The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it's worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: https://duckdblabs.github.io/db-benchmark/ ).

For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it's easier for iteration/testing on a given aggregation to add/remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas/polars in SQL/DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.

1 comments

The real winner is going to be a framework that, during dev, transparently materializes CTEs to temporary tables so you can iterate on them like you’re saying, while continuing to harness SQL for the end product.
Perhaps not exactly what you're talking about, but maybe? (unsure bc the with statements are sometimes called "temp tables")

https://duckdb.org/docs/sql/query_syntax/with#cte-materializ...

Obviously, the materialization is gone after the query has ended, but still a very powerful and useful directive to add to some queries.

There are also a few DuckDB extensions for pipeline SQL languages.

https://duckdb.org/community_extensions/extensions/prql.html

https://duckdb.org/community_extensions/extensions/psql.html

And of course dbt-duckdb https://github.com/duckdb/dbt-duckdb

Do dbt or SQLMesh do this, or if not can you say more about what you’re envisioning?