| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rfoo 482 days ago

They did this back in their trading firm days, and...

Imagine that you have a sequence of numbers. You want to randomly select a window of, say, 1024 consecutive numbers, a sequence, as input to your model. Now, say, you have n items in this sequence, you want to sample n/c (c is a constant and << 1024) sequences in total. How to do fixed shuffle?

The key is, we have overlap in data we want to read. If we brute force fixed shuffle and expand, we need to save 1024/c times more than original data.

This isn't useful for LLMs, but hey, wonder how it started?

1 comments

dekhn 481 days ago

I guess I'm much more of the "materialize the shuffle asychronously from the training loop" kind of person. I agree, the materialization storage cost is very high, but that's normally been a cost I've been willing to accept.

As an ML infra guy I have had to debug a lot of failing jobs over the years, and randomizing datapipes are one of the hardest to debug. Sometimes there will be a "record-of-death" that randomly gets shuffled into a batch, but only causes problems when it is (extremely rarely) coupled with a few other records.

I guess I'll just have to update my priors and accept that inline synchronous randomization with random reads is a useful-enough access pattern in HPC that it should be optimized for. Certainly a lot more work and complexity, hence my question of just how necessary it is.

rfoo 481 days ago

Yeah, I don't want to do this either. This is a super special case, after exploring alternatives with our researchers it's unfortunately needed. As for record-of-death, we made sure that we do serialize all rng state and have our data pipeline perfectly reproducible even when starting from checkpoint.

Building a system for serving read-only data at NVMe SSD speed (as in IOPS) took surprisingly few effort, and is mostly enough for training data. Kudos to DeepSeek who decided to spend extra effort to build a full PFS and share it.