Hacker News new | ask | show | jobs
by lobstrosity420 1126 days ago
Is your argument that the application gracefully recovering from the scenario will somehow make the dev team accommodated?

Hard crashes are not an acceptable substitute for observability, or continuous improvement.

1 comments

I'm not sure what you mean by "accommodated".

This example doesn't even rise to the level of an "active incident" in the Erlang philosophy. In other words, it's not a bug, so there's no urgency to improve it.

You're just misunderstanding the philosophy.

It's still a bug that should be fixed, it's just that the effects are better contained thanks to the ability to self-heal.

I'm just quoting the article. It's not an "active incident", whatever that means.
It means resolving it can wait until work hours instead of waking someone up in the middle of the night on Saturday.
It means that it's not causing a service outage.
If there is data loss it’s an incident, full stop. Your observability layer should be letting you know.
I agree. However, the linked article that I was quoting from seems to see things differently. It describes a situation in which transactions are failing (i.e. data is being lost), but it's not an incident.
Transaction failing does not mean dataloss. If you think it is, you do not understand what graceful recovery means.

Graceful recovery means that something handle that failure after these transactions failed. There is no data loss. They may have been slower, but i think we can agree that a slight temporary latency for no dataloss and graceful handling of unexpected stuff like your database machine being on fire is not so bad?

It's still in your logs and you're still tracking it with whatever o11y suite you're using.