Hacker News new | ask | show | jobs
by mtmail 3126 days ago
A cluster of 10+ machines was processing data for a whole week (7 days). Think Hadoop with counting and grouping numbers. The output was slightly over 2GB. The user specified his home directory as output. The home directory was NFS-mounted and didn't support files over 2GB, thus no file was written and the job had to be rerun.
1 comments

There’s a similar right of passage for Deep Learning engineers using python and Keras - every single one of us has been burned by waiting for a week for a model to train, only to find “h5py” is not installed, exception is thrown and all work lost. (That’s the python module used for persistence of the model weights)

Then the young grasshoppers learn about checkpointing and using our dev ops systems to make sure the environment is up to spec, but I feel this has to happen at least once to each Deep Learning researcher

Something very similar happened to me. To this day, I don't run any python script without the `-i` flag