| I am leading a similar initiative and I have also used Databricks for the preprocessing. Most interesting is what happens between the preprocessing and the model training - the hand-off to the cluster workers. I guess the efficient option is to partition the data, set up shards in-advance and ideally cache or even copy the data to local workers during init. This, of course, breaks some of the promise of being able to scale training flexibly, for instance to experiment with the scaling of compute and data. A different way to go about it is to use a streaming/iterable dataset/loader implementation with its own sharding logic that reads from a central store of parquets with some reasonable row-group size. This gives full flexibility in terms of node/gpu/worker/batch_size for experimentation - e.g. literally as parameters in PyTorch. Of course, one has to also implement caching of remote data since the data is kept centrally. In my opinion, there is no satisfying/flexible solution for this, especially when one also wants to experiment with complex transformations or augmentations in the dataset/loader and remain portable across cloud offerings. So, this has to be implemented from scratch (not too difficult but still a lot of code). The coming datapipes also probably make this trivial. Would love to hear more experiences in how you set this up! Edit:
I guess for NLP this is a good implementation and what Mosaic uses
https://huggingface.co/docs/datasets/stream |