Hacker News new | ask | show | jobs
by abeppu 409 days ago
> Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load.

I don't know how to reconcile this with the emphasis in the page on interacting with datasets relevant to AI which are commonly several orders of magnitude larger than this. What's an AI problem where the data data involved has been less than 10s of mb? I think that only toy problems and datasets could plausibly be smaller (e.g. the training images for the classic MNIST dataset are 47MB, and the whole dataset is 55 https://www.kaggle.com/datasets/hojjatk/mnist-dataset?select... ).

1 comments

Yea except with parquet you don't need to load the entire file, the parquet metadata let's you do http range requests for just the data you need.

For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows:

https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam...

This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data.

In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks.

Here's the glaive reasoning dataset on the Hyperparam hugging face space:

https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

Wow - that's super clever. How do you get away with loading part of the file? Which part do you load?
I’m not OP but as this is a common pattern…

Parquet stores the metadata in the footer so first request is effectively a negative byte range (content length minus footer length). This metadata includes table statistics like “column ‘date_sold’ has minimum date 1-1-1970 and maximum date 12-31-2024,” and row group statistics like “the row group at byte offset X has minimum ‘date_sold’ value of 1-1-2023 and maximum ‘1-1-2024’.”

So if your query tool gets a SQL query with a predicate like “WHERE date_sold > ‘3-1-2024’ AND date_sold < ‘3-30-2024’” then it can use “partition pruning” to fetch only the RowGroup of the parquet file that includes the March 2024 data.

My colleague Artjoms (and co-founder of Splitgraph with me) gave a great presentation [0] on how we achieved this with DataFusion, including visualization of the pruning.

[0] https://youtube.com/watch?v=D_phetiS-4w