Hacker News new | ask | show | jobs
by qxmat 1578 days ago
I've found that external tech requirements are horrible to work with, especially when the underlying stack simply doesn't support it. Normally these are pushed by certified cloud consultants or by an intrepid architect who found another "best practice blog."

It's begins with small requirements such as coming up with a disaster recovery plan only for it to be rejected because your stack must "automatically heal" and devs can't be trusted to restore a backup during an emergency.

Blink and you're implementing redundant networking (cross AZ route tables, DNS failover, SDN via gateways/load balancers), a ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health checks, EFS/FSX network mounts for persistent data that expensive enterprise app insists storing on-disk and some kind of HA database/multi-master SQL cluster.

... months and months of work because a 2 hour manual restore window is unacceptable. And when the dev work is finally complete after 20 zero-downtime releases over 6 months (bye weekend!) how does it perform? Abysmally - DNS caching left half the stack unreachable (partial data loss) and the mission critical Jira Server fail-over node has the wrong next-sequence id because Jira uses an actual fucking sequence table (fuck you Atlassian - fuck you!).

If only the requirement was for a DR run-book + regular fire drills.

2 comments

I think this highlights the importance of actually analyzing your RP/RT (recovery point/recovery time) requirements through the lens of business value, and being honest about the ROI of buying that extra 9 in uptime.

It may be the case that 2 hours of downtime is completely unacceptable for the business, and paying $Xmm extra per year to maintain it is the right call. Or it may be that the business would be horrified to learn how many dollars are being spent to avert a level of downtime that no customer would notice or care about.

If the requirement is just being set by engineering, then it's more about finding the equilibrium where the resource spent on automation balances the cost of the manual toil and the associated morale impact on the team. Nobody wants to work on a team where everything is on fire all the time, and it's time/money well spent to avert that situation.

...how is the JIRA server mission critical? is it tied to CI/CD somehow?
In the enterprise you'll find that Jira is used for general workflow management not just CICD. I've encountered teams of analysts spend their working day moving and editing work items. It's the Quicken of workflow management solutions.

Jira Server is deliberately cobbled by the sequence table + no Aurora support and now EOL (no security updates 1 year after purchase!). DC edition scales horizontally if you have 100k.

Jira in general is a poorly thought out product (looking at you customfield_3726!) but it's held in such a high regard by users it's impossible to avoid.

Pre covid I would have laughed at this. But now, no one knows what a user story should be unless you can reas it off jira and there are no backups of course.
Gives me a fun idea: a program that randomly deletes items out of your backlog.
"Chaos engineering for your backlog"
I done that. I deleted items from the backlog that i thought make no sense (anymore), nobody cared or asked any questions. If you didn't work on it for the last 18 months, it's probably not important and nobody cares.