| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MrPowers 883 days ago

Yea, Spark works best with "right-sized" files.

Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.

When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.

You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.

Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.

The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.

1 comments

adolph 883 days ago

> a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes.

Sounds like this data lake could use a Parquet file listing the Parquet files.

Butter

link

MrPowers 883 days ago

Yea, that's exactly what Delta Lake does. All the table metadata is stored in a Parquet file (it's initially stored in JSON files, but eventually compacted into Parquet files). These tables are sometimes so huge that the table metadata is big data also.

link