| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nonameiguess 1112 days ago

Early on when I'd first started making the transition from pure developer role working only on product to a platform role running the development environment, I was encountering problems with build scripts on CI servers leaving behind a bunch of dead symlinks. Tired of tracking them down manually, I wrote a nice script that automatically found all dead symlinks and deleted them.

It turned out, for some arcane reason I still don't understand, our production instance of Artifactory was running on top of Docker Compose with host path volume mounts, and somehow, symlinks that were not valid from the perspective of the host actually were valid from inside the container, and doing this on all of our servers broke Artifactory. For some even stupider reason, we weren't doing full filesystem-level snapshots at any regular interval (which we started doing after this), so instead I needed to enlist the help of the classic wizard ninja guy who had been acting as a mostly unsupervised one-man team for the past six years who had hacked all of this mess together, documented none of it, and was the only person on the planet who knew how to reconstruct everything.

This was probably still only the second-stupidest full on-prem lab outage I remember, behind the time the whole network stopped working and the only person who had been around long enough remembered they had trialed a temporary demo hardware firewall years earlier, management abandoned the evaluation effort, and it somehow remained there as the production firewall for years without ever being updated before finally breaking.