| HN Mirror

For what use case? Image data storage? For text storage, Parquet is good enough today. PyTorch Data Loader and TF Data provide multi-threaded clients that read ahead in parallel and fill up an in-memory buffer that is then transferred in/out from GPUs. I agree that S3 can be a bottleneck here. That's why we have HopsFS as a global distributed coherent NVMe cache over S3. Anyscale have been doing something similar with a local NVMe cache for S3. Another interesting file format is Lance - it's like Parquet, but for image data. It has an additional index for fast random I/O within a file (to find images).