Hacker News new | ask | show | jobs
by jamesblonde 906 days ago
For what use case? Image data storage? For text storage, Parquet is good enough today. PyTorch Data Loader and TF Data provide multi-threaded clients that read ahead in parallel and fill up an in-memory buffer that is then transferred in/out from GPUs. I agree that S3 can be a bottleneck here. That's why we have HopsFS as a global distributed coherent NVMe cache over S3. Anyscale have been doing something similar with a local NVMe cache for S3. Another interesting file format is Lance - it's like Parquet, but for image data. It has an additional index for fast random I/O within a file (to find images).
1 comments

We are trying to saturate storage->pcie->gpu cards for tasks like gpu-accelerated log analytics, and this is increasingly the bottleneck