Hacker News new | ask | show | jobs
by fr0styMatt2 3937 days ago
For those 'hard to reproduce' issues, it's about knowing what debugging and forensics tools are out there and making sure that the systems you deploy support postmortem debugging and issue analysis. Things like creating crash dump files, then giving your field technicians or customers an easy way to get this data to you with an incident report (or even having the system do it automatically if applicable); logging systems that can be left switched on all the time and dynamically reconfigured (so that there's less chance of the damn thing being off when an issue does occur).

Also awareness of things like tools that analyze or try and provoke race conditions, etc.