|
|
|
|
|
by MrPowers
883 days ago
|
|
Yea, Spark works best with "right-sized" files. Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems. When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong. You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive. Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns. The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files. |
|
Sounds like this data lake could use a Parquet file listing the Parquet files.
Butter