|
|
|
|
|
by kburman
186 days ago
|
|
- Modern DL frameworks (PyTorch DataLoader, WebDataset, NVIDIA DALI) do not require random access to disk. They stream large sequential shards into a RAM buffer and shuffle within that buffer. As long as the buffer size is significantly larger than the batch size, the statistical convergence of the model is identical to perfect random sampling. - AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not. - If you truly need "arbitrary subsetting" without downloading a whole tarball, formats like Parquet or indexed TFRecords allow HTTP Range Requests. You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly. |
|
AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.
Agree that throughput is more of an issue than latency, as you can queue data to CPU memory. Small object throughput is definitely an issue though, which is what I was talking about. Also, there's no need to use HTTP for your requests, so HTTP or TLS overheads are more of self-induced problems of the storage system itself.
You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.
This has exact same throughput problems as small objects though.