|
|
|
|
|
by RobinL
1572 days ago
|
|
This is great. In terms of real-world uses, I'm currently working on enabling DuckDB as a backend in Splink[1], software for record linkage at scale. Central to the software is an iterative algorithm (Expectation Maximisation) that performs a large number of group-by aggregations on large tables. Until recently, it was PySpark only, but we've found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don't have massive datasets. With this in mind, excited to hear that:
> Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging. This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I'm imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud. Anyway, really just wanted to say thanks to the DuckDB team for great work - you're enabling a lot of value downstream! [1] https://github.com/moj-analytical-services/splink
[2] https://github.com/moj-analytical-services/splink_demos/tree... |
|
My duckdb wrapper I sent you in the github issue a few weeks ago linked a pair of five million record datasets in about twenty minutes. Spark took about the three hours to do the same job with an infinite resources cluster.