| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isjustintime 462 days ago
	This is pretty exciting. DuckDB is already proving to be a powerful tool in the industry. Previously there was a strong trend of using simple S3-backed blob storage with Parquet and Athena for querying data lakes. It felt like things have gotten pretty complicated, but as integrations improve and Apache Iceberg gains maturity, I'm seeing a shift toward greater flexibility with less SaaS/tool sprawl in data lakes.

3 comments

RobinL 462 days ago

Yes - agree! I actually wrote a blog about this just two days ago:

May be of interest to people who:

- What to know what DuckDB is and why it's interesting

- What's good about it

- Why for orgs without huge data, we will hopefully see a lot more of 's3 + duckdb' rather than more complex architectures and services, and hopefully (IMHO) less Spark!

https://www.robinlinacre.com/recommend_duckdb/

I think most people in data science or data engineering should at least try it to get a sense of what it can do

Really for me, the most important thing is it makes it so much easier to design and test complex ETL because you're not constantly having to run queries against Athena/Spark to check they work - you can do it all locally, in CI, set up tests, etc.

link

pletnes 462 days ago

I have the same thoughts. However my impression is also that most orgs would choose eg databricks or something for the permission handling, web ui, ++ so what is the equivalent «full rig» with duckdb and S3 / blob storage?

link

RobinL 462 days ago

Yeah I think that's fair, especially from the 'end consumer of the data' point of view, and doing things like row-level permissions.

For the ETL side, where often whole-table access is good enough, I find Spark in particular very cumbersome - there's more than can go wrong vs. DuckDB and it's harder to troubleshoot.

link

yakshaving_jgt 462 days ago

Funny, I read TFA and came to the comments to share exactly this recent blog post of yours. Big fan of your work, Robin!

link

RobinL 462 days ago

Ah nice - reading that made me feel good! Appreciate the feedback!

link

hn1986 462 days ago

from the blog: "This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data."

I don't think we'll ever see this, honestly.

excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.

link

teruakohatu 462 days ago

> excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.

Can you link to the podcast episode?

link

hn1986 461 days ago

episode and transcript Referenced in the blog: https://www.robinlinacre.com/recommend_duckdb/#:~:text=in%20...

link

mritchie712 462 days ago

if you're looking to try out duckdb + iceberg on AWS, we have a solid guide here: https://www.definite.app/blog/cloud-iceberg-duckdb-aws

link

raffraffraff 461 days ago

Kinda the same as metrics/logs systems using blob storage? (Eg Mimir, Loki). Because I remember the hassle of hbase, Cassandra, ELK.

link