|
|
|
|
|
by saj1th
1680 days ago
|
|
> I do not see why it would be much slower than direct access to the storage. Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow. There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO. |
|
Few reasons why Databricks platform shines here.
1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.
2.) End to end MLOps solution - including Feature store, Model registry & Model Serving
3.) Open approach with https://www.mlflow.org/
4.) Glass box (not blackbox) model for AutoML