Hacker News new | ask | show | jobs
by glogla 1679 days ago
In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?

I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.

2 comments

With a datawarehouse, you can only interface with your data in SQL. With big query and snowflake, your data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.

With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.

I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.

BigQuery does support ML. But the pricing is kind of a racket ($250/TB) so I’ll stick to modeling in R/python. Which I guess reinforces your point. I wonder who pays for this.

https://cloud.google.com/bigquery-ml/docs/introduction

My experience is that's how it looks at first. But it is hard to actually make use of lake or lakehouse openness.

You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.

You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:

Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.

The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.

You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.

Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.

In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.

Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.
Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.

But if that's not your situation, then obviously it makes sense to make use of that!

> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.

This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.
Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).
According to their documentation [1], Databricks does not have this capability even for their own engines, and definitely not for "without piping data".

This is what I've personally seen few times - Databricks claiming they can do something and then it turns out they can't. Buyer beware lying salespeople and HN shills.

[1]: https://docs.databricks.com/administration-guide/access-cont...

I have not, but I do not see why it would be much slower than direct access to the storage. Databases are quite good at streaming rows.
> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.

> There is also the question of computing for ML.

Few reasons why Databricks platform shines here.

1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.

2.) End to end MLOps solution - including Feature store, Model registry & Model Serving

3.) Open approach with https://www.mlflow.org/

4.) Glass box (not blackbox) model for AutoML

Here is the thing with the lakehouse though, you have flexibility and don’t need to use multiple engines to achieve the lakehouse vision. Databricks has all the security features a redshift / snowflake does so you can secure databases and tables rather than s3 buckets. It does get more complex if you want to introduce multiple engines but at least you have the option to make that trade off if you want to.

If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.

But you do still have to secure the S3 buckets, right? And I guess also secure the infrastructure you have to deploy in order to run Databricks. Plus then configure for cross-AZ failover etc. So you get flexibility, but I would think at the cost of much more human labor to get it up and running.

Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.

Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.

You can use Scala, Java and Python with Snowflake now, as well as process structured, semi-structured and unstructured data. So I guess that means it doesn't fit into the data warehouse category, but is not a lakehouse either.
Big Query&Data Proc, Redshift&EMR, Synapse&HDR are tied to the cloud vendors. You can’t move easily from AWS stack to GCP without refactoring. Switching costs are higher.

Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.