Hacker News new | ask | show | jobs
by kevindeasis 1681 days ago
They are a data warehouse with analytics? So data warehouse as a service in the cloud?

So they can collect data from different places like sql, images, etc. I think a better question would be what type of data can't they ingest?

Once you have your data i guess you can run some analytics to find out what your data tells you

2 comments

A data lake can be home to many different data formats e.g. parquet, AVRO, Thrift, protobuf, ORC, HDF5S, CSV, JSON all co-existing together. Spark lets you create a virtual abstraction over all of this, and query it as though it was a homogeneous database. There's no need to import data into a centralized format and schema.

This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.

I'd like to add some points: Ive used Snowflake for several years. Snowflake works with structured and semi-structured data (think spreadsheets and JSON). I've never tried working with pics or videos - and I'm not sure it would make sense to do that.

I've evaluated Databricks. It works with the above mentioned structured and semi-structured data. I also suspect it could process unstructured data. My understanding is that it runs Python (and some others), so you can do any "Python stuff, but in the cloud, and on 1000s of computers"

Databricks used to be an Apache Spark as a service company. And Spark is a predominantly Scala code base. PySpark is just a Python binding for the real engine popular in ML circles. In the last couple of years the Databricks platform migrated from open-source Spark to a new proprietary engine written in C++.
You're referring to PySpark, which still does all the heavy lifting in the JVM.