|
|
|
|
|
by platypii
409 days ago
|
|
Yea except with parquet you don't need to load the entire file, the parquet metadata let's you do http range requests for just the data you need. For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows: https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam... This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data. In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks. Here's the glaive reasoning dataset on the Hyperparam hugging face space: https://huggingface.co/spaces/hyperparam/hyperparam?url=http... |
|