|
|
|
|
|
by deliciousturkey
184 days ago
|
|
In AI training, you want to sample the dataset in arbitrary fashion. You may want to arbitrarily subset your dataset for specific jobs. These are fundamentally opposed demands compared to linear access: To make your tar-file approach work, the data has to ordered to match the sample order of your training workload, coupling data storage and sampler design. There are solutions for this, but the added complexity is big. In any case, your training code and data storage become tightly coupled. If you can avoid it by having a faster storage solution, at least I would be highly appreciative of it. |
|
- AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.
- If you truly need "arbitrary subsetting" without downloading a whole tarball, formats like Parquet or indexed TFRecords allow HTTP Range Requests. You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.