| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by isoprophlex 1226 days ago

Quick, in-core data transformation. If you want to transform some data right now, one option is writing pyspark and running that on a spark cluster. But noone really has big big data, there are relatively few cases where you have multi TB datasets, warranting the complexities of running the analyics in a distributed way.

DuckDB lets you process all that locally. It's the OLAP equivalent to SQLite's OLTP.

If I wasn't so beholden to the vagaries and inefficiencies of C-level endorsed enterprise software, I'd immediately be trying this out for data transformations/pipelines. I think that one big box (200+ gb ram, couple of cores and fat IO/network) runs circles around an entire spark cluster.

1 comments

smt88 1226 days ago

Interesting. I need to think about this one a little bit. Thank you.

Is there a reason "in-core" is a specific requirement here?

link

isoprophlex 1226 days ago

Not really, and duckdb doesn't need to hold everything in RAM as i recalll. But it's fast, far faster than several read-process-write steps can be, especially when coordinated over multiple machines

(By the way, maybe I was vague, using overloaded terminology. To be precise with 'in-core' i meant that the solution to an analytic query is held completely in memory, not that it's restricted to using one cpu thread.)

link