Hacker News new | ask | show | jobs
by tomnipotent 1685 days ago
A data lake can be home to many different data formats e.g. parquet, AVRO, Thrift, protobuf, ORC, HDF5S, CSV, JSON all co-existing together. Spark lets you create a virtual abstraction over all of this, and query it as though it was a homogeneous database. There's no need to import data into a centralized format and schema.

This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.