Hacker News new | ask | show | jobs
by eis 1506 days ago
I was just yesterday exploring DuckDB and it looked very promising but I was very surprised to find out that indexes are not persisted (and I assume that means they must fit in RAM).

> Unique and primary key indexes are rebuilt upon startup, while user-defined indexes are discarded.

The second part with just discarding previously defined indexes is super surprising.

https://duckdb.org/docs/sql/indexes

This was an instant showstopper for me or I assume most people whose databases grow to a bigger size at which point an OLAP DB becomes interesting in the first place.

Also the numerous issues in Github regarding crashes make me hesitant.

But I really like the core idea of DuckDB being a very simple codebase with no dependencies and still providing very good performance. I guess I just would like to see more SQLite-esque stability/robustness in the future and I'll surely revisit it at some point.

2 comments

Hey eis, you are correct. We do not support index storage in our latest release. I am currently implementing this, and it is in a fairly advanced stage. So it should be featured in the next release.

This took a little while because we use a fairly modern index structure with no literature definition on how to buffer manage it.

It's good to hear that persistent indexes are coming soon. I saw it was on the roadmap but didn't know how far out this feature was. Do you have an idea when that release could be out?

BTW Do you have some kind of code/docs one can take a look at regarding the index structure? I'm a part-time data structure nerd :)

Here is a paper that was used when building the DuckDB approach!

https://db.in.tum.de/~leis/papers/ART.pdf

There are some other papers and details about the architecture here: https://duckdb.org/why_duckdb

Will the persistent indices still be built on ART indexes? ART is a great choice of data structure. Fast lookups, range queries and prefix queries.

Typesense DB (Typesense.org) under the hood also uses ART as the base datastructure for fast full text search queries.

So I assume it would be straightforward to build an text search engine on top of duck DB and utilize ART indices to do even more lovely things.

Persistent indexes are being actively worked on! Stay tuned. As for the crashes - DuckDB is very well tested and used in production in many places. The core functionality is very mature! Let us know if you test it out! Happy to help if I can.

(disclaimer - on the DuckDB team)

Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!
Yes, that benchmark result is quite old in Duck years! :-)

We actually run that benchmark as a part of our test suite now, so I am certain that there is improvement from that version.

The biggest DuckDB I've used so far was about 400 GB on a machine with ~250 GB of RAM.

There is ongoing work that we are treating as a high priority for handling larger-than-memory intermediate results within a query. But we can handle larger than RAM in many cases already - we sometimes run into issues today if you are joining 2 larger than RAM tables together (depending on the join), or if you are aggregating a larger than RAM table with really high cardinality in one of the columns you are grouping on.

Would you be open to testing out your use case and letting us know how it goes? We always appreciate more test cases!