Hacker News new | ask | show | jobs
by doppelganger1 1671 days ago
Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.
2 comments

Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.

But if that's not your situation, then obviously it makes sense to make use of that!

> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.

This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.
Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).
According to their documentation [1], Databricks does not have this capability even for their own engines, and definitely not for "without piping data".

This is what I've personally seen few times - Databricks claiming they can do something and then it turns out they can't. Buyer beware lying salespeople and HN shills.

[1]: https://docs.databricks.com/administration-guide/access-cont...

Check out https://databricks.com/product/unity-catalog when you get a chance. There are other solutions in this space as well.
I don’t understand what capability you are saying Databricks lacks. This capability is literally the entire premise of the Data Lakehouse. With Snowflake you need to export data out/or pipe data over jdbc/odbc to an external tool. With Databricks you can use SQL for data warehousing and when you need you can work with that same data using python to train an ML model without piping data out over jdbc (using the spark engine). One security model, one dataset, multiple use cases (AI/ML/BI/SQL) on one platform.
They're still lacking things in the SQL space. For example, Databricks say they're ACID compliant, but it's only on a single-table basis. Snowflake offers multi-table ACID consistency, which is something that you would expect by default in the data warehousing world. If I'm loading, say, 10 tables in parallel, I want to be able to roll-back or commit the complete set of transactions in order to maintain data consistency. I'm sure you could work around this limitation, but it would feel like a hack, especially if you're coming from a traditional DWH world (Teradata, Netezza etc.).

Snowflake now offers Scala, Java and Python support, so it would seem their capabilities are converging even more, but both with their own strengths due to their respective histories.

Actually, you would expect that in an OLTP world. DW's for the longest time, even Oracle, recommends you disable txn to get better performance. The logic is implemented in the ETL layer. Very rarely do you need multi-table txn in large scale DW.

Snowpark is still inferior.

I have not, but I do not see why it would be much slower than direct access to the storage. Databases are quite good at streaming rows.
> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.

> There is also the question of computing for ML.

Few reasons why Databricks platform shines here.

1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.

2.) End to end MLOps solution - including Feature store, Model registry & Model Serving

3.) Open approach with https://www.mlflow.org/

4.) Glass box (not blackbox) model for AutoML