Hacker News new | ask | show | jobs
by mytherin 1692 days ago
DuckDB developer here, DuckDB has a query engine that can directly query external data formats (stored in CSV, Parquet, Arrow, Pandas, etc) without loading the data directly, but also has its own columnar ACID-compliant storage format.

It can certainly be used in the same manner as DataFusion, but can also serve as a stand-alone database system. DuckDB aims to have much more comprehensive SQL support beyond only SELECT queries.

1 comments

Thanks for the clarification ! :)

Not an advice, but you should probably consider spinning a secondary product from DuckDB with a sole focus on "reading data from parquet files and running aggregations the most efficiently possible". You can probably skip INSERT, UPDATE, DELETE completely.

There is currently a gap in practical solutions for this pain point. You can use Spark or Airflow, but nothing that comes without a big infra price tag (you can do that with pandas, but you need a large instance to load the entire dataset in memory). I think the right product could even outpace what you currently have with DuckDB.