Hacker News new | ask | show | jobs
by tetha 1070 days ago
> At AWS, the hierarchy of service priorities is crystal clear: Security, Durability, and Availability. In that order. Durability, the assurance that data will not be lost, is a cornerstone of trust, only surpassed by security. Availability, while important, can vary. Different customers have different needs. But security and durability? They're about trust. Lose that, and it's game over. In this regard, InfluxDB has unfortunately dropped the ball.

Interestingly, this is also how I'd allocate tasks to new admins. Like, sure, I'd rather have my load balancers running, but they are stateless and redeploy in a minute. The amount of damage you can do there in less critical environments is entirely acceptable for teaching experiences. Databases or filestores though? Oh boy. I'd rather have someone shadow for a bit first because those are annoying to fix and will always cause unrecoverable loss, even with everything we do against it. Hourly incremental backups still lose up to 59 minutes of data if things go wrong.

> The InfluxDB incident brings to light the ongoing debate around soft vs. hard deletion. It's unacceptable for a hard delete to be the first step in any deprecation process. A clear escalation process is necessary: notify the customer, wait for explicit acknowledgement, disable their APIs for a short period, extend this period if necessary, soft delete for a certain period, notify again, and only then consider a hard delete.

Agreed. At work, I'm pushing that we have two processes: First, we need a process of deprecating a service and migrating customers to better services. This happens entirely at a product management and development level. Here you need to consider the value provided for the customer, how to provide it differently - better - and how to decide to fire customers if necessary. And afterwards, you need a good controlled process to migrate customers to the new services, ideally supported by customer support or consultants. No one likes change, so at least make their change an improvement and not entirely annoying.

And then, if a system or an environment is not needed anymore, leadership can trigger a second process to actually remove the service. I'm however maintaining that this is a second process which is entirely operational between support, operations and account management. It's their job to validate the system is load-free (I like the electricians term here), or that we're willing to accept dropping that load. And even then, if we just see a bunch of health checks on the systems by customers, you always do a scream test at that point and shut it down for a week, or cut DNS or such. And only then you drop it.

It's very, very careful, I'm aware. But it's happened 3-4 times already that a large customer suddenly was like "Oh no we forgot thingy X and now things are on fire and peeps internally are sharpening knifes for the meeting, do anything!" And you'd be surprised how much goodwill and trust you can get as a vendor by being able to bring back that thing in a few minutes. Even if you have to burn it then to turn up the heat to get them off of that service, since it'll be around forever otherwise.