|
|
|
|
|
by dekhn
481 days ago
|
|
I guess I'm much more of the "materialize the shuffle asychronously from the training loop" kind of person. I agree, the materialization storage cost is very high, but that's normally been a cost I've been willing to accept. As an ML infra guy I have had to debug a lot of failing jobs over the years, and randomizing datapipes are one of the hardest to debug. Sometimes there will be a "record-of-death" that randomly gets shuffled into a batch, but only causes problems when it is (extremely rarely) coupled with a few other records. I guess I'll just have to update my priors and accept that inline synchronous randomization with random reads is a useful-enough access pattern in HPC that it should be optimized for. Certainly a lot more work and complexity, hence my question of just how necessary it is. |
|
Building a system for serving read-only data at NVMe SSD speed (as in IOPS) took surprisingly few effort, and is mostly enough for training data. Kudos to DeepSeek who decided to spend extra effort to build a full PFS and share it.