|
|
|
|
|
by wenc
1178 days ago
|
|
I use DuckDB because I can express complex analytic queries better using a SQL mental model. Most software people hate SQL because they can never remember its syntax, but for data scientists, SQL lets us express our thoughts more simply and precisely in a declarative fashion — which gets us query plan optimization as a plus. People have been trying to get rid of SQL for years yet they only end up reinventing it badly. I’ve written a lot of code and the two notations I always gravitate toward are the magrittr + dplyr pipeline notation and SQL. The chained methods notation is a bit too unergonomic especially to express window functions and complex joins. Spark started out with method chaining but eventually found that most people used Spark SQL. |
|
With columnar data DuckDuckGo is somuchfaster at this.
For one of my projects I have what sounds like a dumb workflow: - JSON api fetches get cached in sqlite3 - Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!). - Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files - When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames - Lastly those data frames are small-ish, tidy and flexible
So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.