|
|
|
|
|
by jamesblonde
906 days ago
|
|
For what use case? Image data storage?
For text storage, Parquet is good enough today.
PyTorch Data Loader and TF Data provide multi-threaded clients that read ahead in parallel and fill up an in-memory buffer that is then transferred in/out from GPUs. I agree that S3 can be a bottleneck here. That's why we have HopsFS as a global distributed coherent NVMe cache over S3. Anyscale have been doing something similar with a local NVMe cache for S3.
Another interesting file format is Lance - it's like Parquet, but for image data. It has an additional index for fast random I/O within a file (to find images). |
|