Hacker News new | ask | show | jobs
by scapecast 1679 days ago
The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.

Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.

In Snowflake’s case, that was separation of storage and compute.

In Databrick’s case, it’s the Lakehouse Architecture.

I think the reason why Snowflake is so nervous because they know they can’t win this game.

4 comments

To be fair Apache Spark, which started long before either company existed, was built on the assumption that compute and storage should be separate. Unlike Hadoop, Spark did not come with any storage system and could read from any source.
> To be fair Apache Spark, which started long before either company existed

Databricks was founded before Spark 1.0 released by Spark's creators.

Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.

Hadoop was built on the notion that commodity hardware, when pooled together, can be extremely cheap and powerful. The problem is, to manage it, is a nightmare. Cloudera/HWX and others were unable to reduce the management burden and their inability to pivot to a cloud based architecture really sunk their ship.
SF spreads a lot of FUD saying that DB can’t perform, and it was true. DB then went out and hired a lot of engineering talent with a diverse background and has been investing a lot of money in being a best in class SQL offering, so what do you do? You do something to get people’s attention. They’re saying, “hey, we have great performance too, you should also look at us for your SQL workloads.”
> I think the reason why Snowflake is so nervous because they know they can’t win this game.

Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?

They could in principle. GCP, for instance, does do that. So does HP. And Databricks don't mind that as they have a strong open source legacy. But that takes away the proprietary lock-in strategy of Snowflake.
Delta is open source, but Databricks keeps optimizations for themselves as proprietary. I'm not sure why it would be any better than Snowflake's solution, which is automatically deployed across multiple AZs as a fully HA system and gives full ACID transaction compliance across any number of tables (not just per-table).
Essentially with Databricks making Delta open source, you can move away from Databricks to EMR or Presto (with their own optimizations) without incurring a data tax. You're also able to move between cloud providers at ease as data sits in low cost buckets.
In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?

I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.

With a datawarehouse, you can only interface with your data in SQL. With big query and snowflake, your data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.

With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.

I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.

BigQuery does support ML. But the pricing is kind of a racket ($250/TB) so I’ll stick to modeling in R/python. Which I guess reinforces your point. I wonder who pays for this.

https://cloud.google.com/bigquery-ml/docs/introduction

My experience is that's how it looks at first. But it is hard to actually make use of lake or lakehouse openness.

You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.

You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:

Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.

The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.

You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.

Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.

In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.

Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.
Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.

But if that's not your situation, then obviously it makes sense to make use of that!

> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.

This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.
Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).
I have not, but I do not see why it would be much slower than direct access to the storage. Databases are quite good at streaming rows.
> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.

Here is the thing with the lakehouse though, you have flexibility and don’t need to use multiple engines to achieve the lakehouse vision. Databricks has all the security features a redshift / snowflake does so you can secure databases and tables rather than s3 buckets. It does get more complex if you want to introduce multiple engines but at least you have the option to make that trade off if you want to.

If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.

But you do still have to secure the S3 buckets, right? And I guess also secure the infrastructure you have to deploy in order to run Databricks. Plus then configure for cross-AZ failover etc. So you get flexibility, but I would think at the cost of much more human labor to get it up and running.

Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.

Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.

You can use Scala, Java and Python with Snowflake now, as well as process structured, semi-structured and unstructured data. So I guess that means it doesn't fit into the data warehouse category, but is not a lakehouse either.
Big Query&Data Proc, Redshift&EMR, Synapse&HDR are tied to the cloud vendors. You can’t move easily from AWS stack to GCP without refactoring. Switching costs are higher.

Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.