Hacker News new | ask | show | jobs
by bpaneural 1670 days ago
They must do. But if you've been in this area for long enough, I'd put my money on Databricks, if anything, because of their open source integrity
2 comments

Photon, which was used in their benchmark, is not open source. Don't be fooled by DB.
Apache Spark is an open API. You can build your ETL with it and run it on an open source Spark cluster, an AWS EMR cluster, or a Databricks cluster. It will work across all three (and others) because the API is open.

Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.

This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.

Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.

There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.

By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.

This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.

If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.

Open APIs enables an open ecosystem, which encourages competition.

Databricks isn't open source, as they keep hold of all the IP that makes it much better than OS Spark. Whether you buy Snowflake or Databricks, you're buying proprietary software.
With Snowflake data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R. With the Databricks, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (Spark, presto and other engines that support Delta) so you are not locked into one compute engine.
This is very true. They make the lowest common denominator parts "open source" but control all of the commits. Also the query engine used for this benchmark is proprietary, closed source (Photon)
The 'open' here refers to the data. Delta lake can be read/written by multiple open source engines, not just Spark. Not to mention, if you want you can use Databricks with Parquet, though the experience won't be as good.

But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.

Do you have to pay to export data out of Snowflake? Yes. They have a nice guide on how to spend money doing it (https://docs.snowflake.com/en/user-guide/data-unload-overvie...).

Do you have to pay to export data out of Databricks? No, it's already sitting where you want it.

Which one is open? I wonder

I used Snowflake in my previous company. When we loaded data into Snowflake, we loaded it FROM S3/Blob where we also kept it.
So you were paying to store the same data twice. Once in S3 and once in Snowflake. Why not just purge it from S3 and only keep it in Snowflake?
Not entirely true. There is a bi-directional Spark connector for Snowflake written by Databricks. And exporting your data in bulk out of Snowflake into any number of open formats is incredibly easy using the COPY INTO command. You can also use Snowflake on top of Parquet and even Delta Lake.

This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.

It is not a "small" cost. The cost is proportional to the size of the data exported.

For all intents and purposes, large amounts of data are locked into Snowflake. Is it theoretically possible to export a petabyte out of SF? Sure.

Do I want to spend money on it? Not really. That is what I mean by the "data doesn't come out".

"Exporting" a petabyte out of Databricks is a no-op. I can already read Deltalake from other open source tools.

"Exporting PB from Snowflake" is only ever relevant if you want to move from Snowflake to something else. In that case, all other migration costs (recoding, redocumenting and especially revalidating everything, if in regulated environment) are going to make any cost of data movement irrelevant.

This is just FUD.

So if I stop paying Databricks, I can no longer use their proprietary query engine (Photon), right? I have to use something else, like Open Source Spark SQL which is slower and will cost a lot more money.

There are different ways to lock customers in and both Databricks and Snowflake are playing the game.

Agreed there is a small cost, but it is possible, which is at odds with your statement "with Snowflake, the data never comes out".