Hacker News new | ask | show | jobs
by mkaic 1203 days ago
I'm currently operating a very small (10s of millions of rows, ~20GB of total data) low-write MySQL DB with a couple different tables. I'm new to RDBs in general and am using MySQL because my thought was any "real" DB would be better than our previous "pipeline", which was just doing all our data filtering/merging with CSVs and Pandas in Python (extremely slowly, and frustrating).

I like the simplicity of DuckDB's proposal, but haven't seen much info about how fast to expect it to be in comparison with traditional RDBs, for smaller, mostly-read-only applications.

3 comments

Scanning through a CSV can be quite close to querying a SQL database in performance when the SQL database doesn't have any indices. The primary benefits of using a SQL database for querying are (1) indices and (2) a declarative query language. Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices), where you get the declarative query language and query planner but your data's still just CSV/JSON files.

For a dataset that size, I'd probably use SQLite to avoid having to manage a persistent MySQL process, especially when it's being used as an alternative to CSV files. That is, unless there's a MySQL/Postgres server already running I can just create a new database on.

> Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices)

DuckDB automatically creates indexes for all general-purpose columns. However, they're not persisted.

https://duckdb.org/docs/sql/indexes.html

DuckDB is "column oriented" vs "row oriented". I have found it 10x* faster for queries on data your size compared to SQLite or MySQL or Postgres. The added advantage of it being a single file is very nice as well.

*I use the HoneySQL (clojure) library to programmatically build up queries and execute them via the JDBC driver.

Perhaps have a look at this article [1]

[1] https://www.vantage.sh/blog/querying-aws-cost-data-duckdb