Hacker News new | ask | show | jobs
by riku_iki 2681 days ago
I doubt 1.5B params will fit any single GPU. I think they spread parts of models between GPUs/TPUs similarly to mesh-tensorflow: https://arxiv.org/abs/1811.02084