|
|
|
|
|
by abeppu
409 days ago
|
|
> Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load. I don't know how to reconcile this with the emphasis in the page on interacting with datasets relevant to AI which are commonly several orders of magnitude larger than this. What's an AI problem where the data data involved has been less than 10s of mb? I think that only toy problems and datasets could plausibly be smaller (e.g. the training images for the classic MNIST dataset are 47MB, and the whole dataset is 55 https://www.kaggle.com/datasets/hojjatk/mnist-dataset?select... ). |
|
For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows:
https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam...
This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data.
In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks.
Here's the glaive reasoning dataset on the Hyperparam hugging face space:
https://huggingface.co/spaces/hyperparam/hyperparam?url=http...