Hacker News new | ask | show | jobs
by feqgmmr2 1679 days ago
The 'open' here refers to the data. Delta lake can be read/written by multiple open source engines, not just Spark. Not to mention, if you want you can use Databricks with Parquet, though the experience won't be as good.

But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.

3 comments

Do you have to pay to export data out of Snowflake? Yes. They have a nice guide on how to spend money doing it (https://docs.snowflake.com/en/user-guide/data-unload-overvie...).

Do you have to pay to export data out of Databricks? No, it's already sitting where you want it.

Which one is open? I wonder

I used Snowflake in my previous company. When we loaded data into Snowflake, we loaded it FROM S3/Blob where we also kept it.
So you were paying to store the same data twice. Once in S3 and once in Snowflake. Why not just purge it from S3 and only keep it in Snowflake?
Not entirely true. There is a bi-directional Spark connector for Snowflake written by Databricks. And exporting your data in bulk out of Snowflake into any number of open formats is incredibly easy using the COPY INTO command. You can also use Snowflake on top of Parquet and even Delta Lake.

This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.

It is not a "small" cost. The cost is proportional to the size of the data exported.

For all intents and purposes, large amounts of data are locked into Snowflake. Is it theoretically possible to export a petabyte out of SF? Sure.

Do I want to spend money on it? Not really. That is what I mean by the "data doesn't come out".

"Exporting" a petabyte out of Databricks is a no-op. I can already read Deltalake from other open source tools.

"Exporting PB from Snowflake" is only ever relevant if you want to move from Snowflake to something else. In that case, all other migration costs (recoding, redocumenting and especially revalidating everything, if in regulated environment) are going to make any cost of data movement irrelevant.

This is just FUD.

I think it's important to understand how this kind of scenario comes up. It's unusual to want to move a whole PB at one time, and yeah in that case these other costs would come up. Problem is, the cost is more insidious than that.

Consider a scenario where data is coming in periodically, say daily, from some source, server logs, sensor data, whatever. And the user wants to train models daily on the data and they also want to do some SQL. Maybe they ingest the data directly into SF and copy it out for training, or they do it the other way round, land it in object store and the ingest into SF. This is unlikely to be a humongous amount of data, it's probably not a PB. However, this adds up, maybe for some use cases it becomes a PB in a month, maybe in a quarter, maybe it only adds up to a PB in a year.

Thing is, without a Lakehouse architecture, the user will pay to store and copy that data multiple times (at least twice) no. matter. what. They may not pay for a PB in one shot, but you can bet that eventually they'll pay multiple times to store and copy that PB.

It's very relevant if you ever want to do serious ML or anything other than SQL. Of course Snowflake wants you to think that you never need another platform. Every customer knows that's not the case.
So if I stop paying Databricks, I can no longer use their proprietary query engine (Photon), right? I have to use something else, like Open Source Spark SQL which is slower and will cost a lot more money.

There are different ways to lock customers in and both Databricks and Snowflake are playing the game.

I’m not sure this locks anyone in. The APIs are open and Spark code will run on, say EMR, just fine.

Every vendor, be it Snowflake, Databricks, EMR, Athena, BQ, … charges for use of the engine. The difference with a Lakehouse is that one doesn’t have to pay the vendor for the simple ability to use the data with another offering. That’s what you have to pay for with closed systems, whether it’s data on the way in or data on the way out.

Agreed there is a small cost, but it is possible, which is at odds with your statement "with Snowflake, the data never comes out".