| HN Mirror

Maybe the app comes back up with all its users missing. Maybe you get an email a day later telling you you've been hacked and listing details of your database. Or maybe the one instance wedged itself in a weird way (crashed with a lock open, etc.) and destabilized the system, and you spent sixteen hours getting everything back up.

The severity of a bug has nothing to do with its commonality. Sometimes there's something that happens once, and bankrupts your business. Security vulnerabilities are such.

However, I'm more confused by the statement "add central logging"—how are you doing this, and how much time does it cost you? If you mean enabling logging code you already wrote using ops-time configuration (in effect, increasing the log-level) then I can see your point. If you mean adding logging code, then you're making your ops work block on developers.

Either way, what is the cost you're imagining to central logging, that you would consider adding it in specific cases where it's "worth it", but not otherwise? It's just a bunch of streams going somewhere, getting collated, getting rotated, and then getting discarded. The problem itself is about as well-known as load-balancing; it's infrastructure you just set up once and don't have to think about. It doesn't even have scaling problems!