Hacker News new | ask | show | jobs
by platypii 409 days ago
Yea except with parquet you don't need to load the entire file, the parquet metadata let's you do http range requests for just the data you need.

For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows:

https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam...

This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data.

In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks.

Here's the glaive reasoning dataset on the Hyperparam hugging face space:

https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

1 comments

Wow - that's super clever. How do you get away with loading part of the file? Which part do you load?
I’m not OP but as this is a common pattern…

Parquet stores the metadata in the footer so first request is effectively a negative byte range (content length minus footer length). This metadata includes table statistics like “column ‘date_sold’ has minimum date 1-1-1970 and maximum date 12-31-2024,” and row group statistics like “the row group at byte offset X has minimum ‘date_sold’ value of 1-1-2023 and maximum ‘1-1-2024’.”

So if your query tool gets a SQL query with a predicate like “WHERE date_sold > ‘3-1-2024’ AND date_sold < ‘3-30-2024’” then it can use “partition pruning” to fetch only the RowGroup of the parquet file that includes the March 2024 data.

My colleague Artjoms (and co-founder of Splitgraph with me) gave a great presentation [0] on how we achieved this with DataFusion, including visualization of the pruning.

[0] https://youtube.com/watch?v=D_phetiS-4w