Hacker News new | ask | show | jobs
by rsanders 2775 days ago
I believe that if you have a Parquet file meeting certain criteria, it's directly parallelizable as multiple Spark partitions without any shuffling. The splits would occur at Parquet row group boundaries, I believe.

See https://stackoverflow.com/questions/27194333/how-to-split-pa..., https://parquet.apache.org/documentation/latest/, etc.

Whether it's better to have multiple Parquet files or a single parallelizable Parquet file is dependent on your environment and application. At my company, we've tended to have a single row group per file (and one HDFS block per file), in part due to historical reasons.