|
Had to launch our AI product next day else customer would jump off and this would've basically killed the company.
But AI was far from perfect and did only work for very small amount of usecases.
Exchanged the AI with a queue and UI so humans could do the task instead in realtime.
Worked. Went live. Completly smashed competitors. -- Just at launchday our internal infrastructure was fucked pretty bad and we faced a severe outage on everything non-productive.
Product failed at some place to do proper error handling and basically got stuck forever since it couldn't reach some monitoring endpoint we had still placed on our int infra.
We couldn't build a fix, since basically CI/Signing was part of the outage.
ETA to get infra back up was something like ~8h. Panic grew as it would've totally fucked our go live.
Hacked together a dummy http server returning status code 200 and exchanged DNS entry to point to that instead of int infra.
Worked. Went live. -- Had to do an audit before we could go live, due to deadlines and planned vacation of auditor it needed to be spot on, no time for touch ups and re-audit.
During the audit, he asked for a thing we forgot to do (some security alerting on a 3rd party tool which didnt support it), told him something along "sure, love to show that, yadda yadda, lets have lunch first." went to the most chatty guy in the office to join us for lunch (so he wastes some time), faked a "oh, shit, forgot i have a sync call, will catch you up there".
While they were having lunch, I checked out the 3rd party code, added the alerting, wasted 20 mins with their broken buildsys. Managed to build the one lib necessary for the alerts, but not the whole image. Overrode the lib in the original img instead and hoped shit would work. They were back too early, couldnt test, just fired up the deployment, wasted some more time babbling with the auditor till it went through. We live tested it together.
Worked. Audit went through without any remarks and we hit launchday. |