| Some additional context: * Companies are querying thousands / tens of thousands of Parquet files stored in the cloud via Spark * Parquet lakes can be partitioned which works well for queries that filter on the partitioned key (and slows down queries that don't filter on the partition key) * Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering. * Parquet files allow for the addition of custom metadata, but Spark doesn't let users use the custom metadata when filtering * Spark is generally bad at joining two big tables (it's good at broadcast joins, generally when one of the tables is 2GB or less) * Companies like Snowflake & Memsql have Spark connectors that let certain parts of queries get pushed down. There is a huge opportunity to build a massive company on data lakes optimized for Spark. The amount of wasted compute cycles filtering over files that don't have any data relevant to the query is staggering. |
I was listening to the A16Z podcast and they were discussing this in depth.