Hacker News new | ask | show | jobs
by smt88 1226 days ago
This is an interesting niche. Can anyone explain what they're using it for currently?

Much like Redis, I admire the technology but can't think of a project I've worked on that would benefit from it.

Is it for games, maybe? Desktop or mobile apps?

5 comments

Quick, in-core data transformation. If you want to transform some data right now, one option is writing pyspark and running that on a spark cluster. But noone really has big big data, there are relatively few cases where you have multi TB datasets, warranting the complexities of running the analyics in a distributed way.

DuckDB lets you process all that locally. It's the OLAP equivalent to SQLite's OLTP.

If I wasn't so beholden to the vagaries and inefficiencies of C-level endorsed enterprise software, I'd immediately be trying this out for data transformations/pipelines. I think that one big box (200+ gb ram, couple of cores and fat IO/network) runs circles around an entire spark cluster.

Interesting. I need to think about this one a little bit. Thank you.

Is there a reason "in-core" is a specific requirement here?

Not really, and duckdb doesn't need to hold everything in RAM as i recalll. But it's fast, far faster than several read-process-write steps can be, especially when coordinated over multiple machines

(By the way, maybe I was vague, using overloaded terminology. To be precise with 'in-core' i meant that the solution to an analytic query is held completely in memory, not that it's restricted to using one cpu thread.)

I'm interested in this, too.

I can totally see how not having to manage a standalone RDBMS makes sense. But, what's the real-world advantage over something like SQLite?

I mean, the idea of an in-memory relational engine for things like games or embedded totally makes sense, but this seems to target large datasets and deep analysis.

As far as I understand with this model you pretty much re-ingest data from the "raw" source on startup every time. Is this correct?

Judging by the rise on interest I'm sure there's an obvious use case I'm not seeing either.

> But, what's the real-world advantage over something like SQLite?

This very specific question is what I'm trying to understand. SQLite can be run in memory and as a temporary store.

Duckdb excels at OLAP workloads that SQLite would choke on
think BI tools, analytics dashboards for exploratory analysis, or even just exploratory analysis on the terminal with it's rich query capabilities. you can keep analytics data in SQLite, but DuckDB will process it faster/easier for the analytics use cases.
> think BI tools, analytics dashboards for exploratory analysis, or even just exploratory analysis on the terminal with it's rich query capabilities

I thought about that, but I'd never use DuckDB for it because DuckDB is locked into a single process. I can't figure out a benefit of being suck with one core when I always have between 2 and 32 available to me.

DuckDB uses all of your cores! It just uses threads, not processes!
We're using it to migrate data pipelines in AWS which were previously run using Glue to Lambda with duckdb. Glue was too heavyweight, slow and expensive for our GB data volumes. We consume csv files use a lambda and duckdb to convert them to parquet. Then another lambda to load these parquet files and do our transformation logic (deduplications, enrichments, clean up, etc) and writing out to parquet files.
Hmm, interesting... so basically DuckDB works in this case because there's no way to parallelize the migration of a single volume anyway?

This is definitely a pretty niche case, though, so there must be something more general that this was built to do.

From what I can gather, it's in-process, so less of a hassle when installing things or doing system administration. Also I suppose easier to port, and e.g. run in a browser. I guess any speed improvements wrt a DB running in a separate process will be tiny and irrelevant for most applications.
> easier to port, and e.g. run in a browser

This has to be the main point, right? DuckDB isn't the first mover here (SQL.js, which is SQLite compiled to WASM using emscripten, seems to work fine), but perhaps DuckDB is better as a purpose-built solution.

Might be wrong, but it looks like duckdb lets you host the database engine in your process, so you don't pay for IPC. It is the opposite of Redis, as you use it to share memory between processes
I'm confused by your sentence structure. Are you saying DuckDB lets you share memory between processes?

If so, that's the opposite of what DuckDB does. Under to "When not to use DuckDB" section of their website, they say:

> "[Do not use DuckDB when] writing to a single database from multiple concurrent processes"

Honestly, that's the most baffling part. I can't imagine wanting any database that's locked in a single process.