| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tomnipotent 1685 days ago
	A data lake can be home to many different data formats e.g. parquet, AVRO, Thrift, protobuf, ORC, HDF5S, CSV, JSON all co-existing together. Spark lets you create a virtual abstraction over all of this, and query it as though it was a homogeneous database. There's no need to import data into a centralized format and schema. This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.