Hacker News new | ask | show | jobs
by kartoonhero 1679 days ago
Please read up on Lakehouse.

Data Lake + Merge support + DW performance is now possible.

That is the game changer.

2 comments

It'll take a few more years until these companies fixed all the bugs and address all the scalability issues.

As of today, these companies are not good enough to take on the Data Warehouse part.

Spark has always been able to handle way larger scale than any DW.
Handle what though?

Can Spark queries 100Bn structured data performing aggregation on multiple fields (or dimension?)

In my previous company, we had 63 petabytes of data in Snowflake.
That sounds great: storage problem is solved.

What about large scale read via OLAP queries (y'know, the typical measures and dimensions)

That's a respectable amount for a DW, true. Spark and it's ilk are designed for much larger scales though. Multiple FAANG use cases for Spark are in the petabytes per week range.
Do you work for Databricks?
They must do. But if you've been in this area for long enough, I'd put my money on Databricks, if anything, because of their open source integrity
Photon, which was used in their benchmark, is not open source. Don't be fooled by DB.
Apache Spark is an open API. You can build your ETL with it and run it on an open source Spark cluster, an AWS EMR cluster, or a Databricks cluster. It will work across all three (and others) because the API is open.

Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.

This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.

Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.

There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.

By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.

This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.

If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.

Open APIs enables an open ecosystem, which encourages competition.

Databricks isn't open source, as they keep hold of all the IP that makes it much better than OS Spark. Whether you buy Snowflake or Databricks, you're buying proprietary software.
With Snowflake data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R. With the Databricks, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (Spark, presto and other engines that support Delta) so you are not locked into one compute engine.
This is very true. They make the lowest common denominator parts "open source" but control all of the commits. Also the query engine used for this benchmark is proprietary, closed source (Photon)
The 'open' here refers to the data. Delta lake can be read/written by multiple open source engines, not just Spark. Not to mention, if you want you can use Databricks with Parquet, though the experience won't be as good.

But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.

Do you have to pay to export data out of Snowflake? Yes. They have a nice guide on how to spend money doing it (https://docs.snowflake.com/en/user-guide/data-unload-overvie...).

Do you have to pay to export data out of Databricks? No, it's already sitting where you want it.

Which one is open? I wonder

Not entirely true. There is a bi-directional Spark connector for Snowflake written by Databricks. And exporting your data in bulk out of Snowflake into any number of open formats is incredibly easy using the COPY INTO command. You can also use Snowflake on top of Parquet and even Delta Lake.

This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.

Agreed there is a small cost, but it is possible, which is at odds with your statement "with Snowflake, the data never comes out".