|
|
|
|
|
by mtmail
3126 days ago
|
|
A cluster of 10+ machines was processing data for a whole week (7 days). Think Hadoop with counting and grouping numbers. The output was slightly over 2GB. The user specified his home directory as output. The home directory was NFS-mounted and didn't support files over 2GB, thus no file was written and the job had to be rerun. |
|
Then the young grasshoppers learn about checkpointing and using our dev ops systems to make sure the environment is up to spec, but I feel this has to happen at least once to each Deep Learning researcher