|
What about auto-saving recovering data? It really depends upon the language and environment used. I work with C (almost legacy code at this point), and if the program generates a segfault, there is no way to safely store any data (for all I know, it could have been trying to auto-save recovery data when it happened). About the best I can hope for is that it shows itself during testing but hey, things slip into production (last time that happened in an asynchronous, event driven C program, the programmer maintaining the code violated an unstated assumption by the initial developer (who was no longer with the company) and program go boom in production). At that point, the program is automatically restarted, and I get to pour through a core dump to figure out the problem. I'm not a fan of defensive programming as it can hide an obvious bug for a long time (I consider it a Good Thing that the program crashed otherwise we might have gone months, or even years, with noticing the actual bug). Logging is an art. Too little, and it's hard to diagnose. Too much and it's hard to slog through. There's also the possibility that you don't log the right information. I've had to go back and amend logging statements when something didn't parse right (okay, what are our customers sending us now? Oh nice! The logs don't show the data that didn't parse---the things you don't think about when coding). And then there are the monumental screw-ups that no one foresaw the consequences of. Again, at work, we receive messages on service S, which transforms and forwards the request to service T, which queries service E. T also sends continuous queries (a fixed query we aren't charged for [1]) to E to make sure it's up. Someone, somewhere, removed the fixed query from E. When the fixed query to E returned "not found," the code in T was written in such a way that failed to distinguish "not found" with "timedout" (because that fixed query should never have been deleted, right?) and thus, T shut down (because it had nothing to query), which in turn shut down S (because it had nothing to send the data to), which in turn meant many people were called ... Then there was the routing error which caused our network traffic to be three times higher than expected and misrouted UDP replies ... Error handling and reporting is hard. Maybe not cache invalidation and naming things hard, but hard none-the-less. [1] Enterprise system here. |
Not when you do it the right way! You should only mitigate unexpected situations if you also log it, monitor it and handle it with error callback etc.
Also see my other comment in this thread : https://news.ycombinator.com/item?id=12871541