Hacker News new | ask | show | jobs
by wmf 294 days ago
If your software can handle machine failures, 20% extra performance is absolutely worth some extra failures.
1 comments

I think this is the best path if your problem can support it.

I use a 5950X for running genetic programming and neuroevolution experiments and about once every 100 hours the machine will just not like the state/load it is experiencing and will restart. My approach is to checkpoint as often as possible. I restart the program the next morning and it deserializes the last snapshot from disk. Worst case, I lose 5 minutes of work.

This also helps with Windows updates, power outages, and EM/cosmic radiation.