Hacker News new | ask | show | jobs
by isjustintime 462 days ago
This is pretty exciting. DuckDB is already proving to be a powerful tool in the industry.

Previously there was a strong trend of using simple S3-backed blob storage with Parquet and Athena for querying data lakes. It felt like things have gotten pretty complicated, but as integrations improve and Apache Iceberg gains maturity, I'm seeing a shift toward greater flexibility with less SaaS/tool sprawl in data lakes.

3 comments

Yes - agree! I actually wrote a blog about this just two days ago:

May be of interest to people who:

- What to know what DuckDB is and why it's interesting

- What's good about it

- Why for orgs without huge data, we will hopefully see a lot more of 's3 + duckdb' rather than more complex architectures and services, and hopefully (IMHO) less Spark!

https://www.robinlinacre.com/recommend_duckdb/

I think most people in data science or data engineering should at least try it to get a sense of what it can do

Really for me, the most important thing is it makes it so much easier to design and test complex ETL because you're not constantly having to run queries against Athena/Spark to check they work - you can do it all locally, in CI, set up tests, etc.

I have the same thoughts. However my impression is also that most orgs would choose eg databricks or something for the permission handling, web ui, ++ so what is the equivalent «full rig» with duckdb and S3 / blob storage?
Yeah I think that's fair, especially from the 'end consumer of the data' point of view, and doing things like row-level permissions.

For the ETL side, where often whole-table access is good enough, I find Spark in particular very cumbersome - there's more than can go wrong vs. DuckDB and it's harder to troubleshoot.

Funny, I read TFA and came to the comments to share exactly this recent blog post of yours. Big fan of your work, Robin!
Ah nice - reading that made me feel good! Appreciate the feedback!
from the blog: "This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data."

I don't think we'll ever see this, honestly.

excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.

> excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.

Can you link to the podcast episode?

episode and transcript Referenced in the blog: https://www.robinlinacre.com/recommend_duckdb/#:~:text=in%20...
if you're looking to try out duckdb + iceberg on AWS, we have a solid guide here: https://www.definite.app/blog/cloud-iceberg-duckdb-aws
Kinda the same as metrics/logs systems using blob storage? (Eg Mimir, Loki). Because I remember the hassle of hbase, Cassandra, ELK.