Hacker News new | ask | show | jobs
by benjaminwootton 1679 days ago
Ive been following this and it’s kind of embarrassing to watch.

I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.

It makes no sense to fall out about this though.

For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.

2 comments

What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.
Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

Also what kind of queries are we talking about?

> Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...

It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.