| HN Mirror

it reads as very mid-level - enough technical depth to identify problems, but not enough to know where to focus. The major point of writing post mortem documentation is to identify your own flaws and risks that led to the issue, so you can fix your own stuff, not to throw a list of action items over the fence. you especially do not write somebody else a bunch of action items without getting their review before publishing.

first off, you are building and running a DBA agent in production, so as a reviewer I want to know why the deployment pipeline for your agent didn't catch this error. What test are you missing? How are you going to improve the test harness for the future?

Id also want to hear about industry best practices, based on comments in this thread, "NEVER FUCKING GUESS" is a prompting anti-pattern that creates more desperate outputs to get the calls done, but id expect your prompt to have a line for output formatting like "this operation cannot be completed with the given api key"

there are also dev ops best practices - you should be deploying your db changes like you deploy code, with code review. You should have a really good reason to skip running db migrations through a deployment pipeline with appropriate tests all the way through, to instead use your dba agent separately for each stage. Its pretty standard that teams use agents to produce deterministic code, then deploy that; thats a simple process change that would mitigate most of the deleting prod risk. Did your changes to production follow something like a 2 person review? have two people look at the commands to run before running them? why not?

the agent response accurately points out a risk which goes unaddressed - why do you have staging and prod commingled? Have you fixed that problem yet by making a second account or volume or whatever that gives you stage isolation? if you are purposefully having staging run against the prod tables, staging is prod

a senior post mortem should be clearly actionable by your own team to make that not happen again. You own your system, not cursor or railway. Maybe you considered these things in a different document, but the only other thing you point at is that you first wanted to blame anthropic.