Hacker News new | ask | show | jobs
by lmeyerov 906 days ago
We have been excited to dig into the Iceberg era of more managed parquet storage... But they are still years behind on supporting fast GPU IO (GPUDirect/cuFile). So every time we look at bringing them to a customer for powering AI workloads... We hit that wall.

It seems inevitable, more of a when vs if. Being able to have our cake & eat it too will be very cool :)

1 comments

For what use case? Image data storage? For text storage, Parquet is good enough today. PyTorch Data Loader and TF Data provide multi-threaded clients that read ahead in parallel and fill up an in-memory buffer that is then transferred in/out from GPUs. I agree that S3 can be a bottleneck here. That's why we have HopsFS as a global distributed coherent NVMe cache over S3. Anyscale have been doing something similar with a local NVMe cache for S3. Another interesting file format is Lance - it's like Parquet, but for image data. It has an additional index for fast random I/O within a file (to find images).
We are trying to saturate storage->pcie->gpu cards for tasks like gpu-accelerated log analytics, and this is increasingly the bottleneck