Hacker News new | ask | show | jobs
by wenc 1178 days ago
I use DuckDB because I can express complex analytic queries better using a SQL mental model. Most software people hate SQL because they can never remember its syntax, but for data scientists, SQL lets us express our thoughts more simply and precisely in a declarative fashion — which gets us query plan optimization as a plus.

People have been trying to get rid of SQL for years yet they only end up reinventing it badly.

I’ve written a lot of code and the two notations I always gravitate toward are the magrittr + dplyr pipeline notation and SQL.

The chained methods notation is a bit too unergonomic especially to express window functions and complex joins.

Spark started out with method chaining but eventually found that most people used Spark SQL.

1 comments

In addition to this here's one really specific case: ever had a pandas groupby().apply() that took forever often mostly re-aggregating after the apply?

With columnar data DuckDuckGo is somuchfaster at this.

For one of my projects I have what sounds like a dumb workflow: - JSON api fetches get cached in sqlite3 - Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!). - Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files - When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames - Lastly those data frames are small-ish, tidy and flexible

So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.