Hacker News new | ask | show | jobs
by gyrovagueGeist 798 days ago
The interconnect and network topology is also a big component of the hardware where you can't "fake The Real Thing" in practice. You can often get fairly confident in program correctness for toy problem runs by scaling 1-~40 ranks on your local machine, but you can't tell much about the performance until you start running on a real distributed system where you can see how much your communication pattern stresses the cluster.

Or if you run into bugs / crashes that needs 1000s of processes or a full scale problem instance to reproduce, god help you and your SLURM queue times.