Hacker News new | ask | show | jobs
by gregtaleck 3394 days ago
"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future."
1 comments

I wonder as to why this wasn't thought of while creating the system? Of course, I don't have experience at that scale, so just wondering.
It seems like sometimes this is just how iteratively automating things works, especially on an internal-facing tool.

You have some process that starts out being "deploy this app with this java code". You deploy once and while, so it's not a big deal. But then those changes get a bit more frequent and so you pull out the common bits and the process becomes "make this YAML change in git and redeploy the app".

That works until you find yourself deploying 5 times a day, so you turn it into a MySQL table, and the process becomes "write a ROLL plan that executes this UPDATE x=y WHERE t=u; command"

After a while you get super annoyed at some quirk of the commands and figure, "Ok, fine, I'll just add an endpoint and some logic that just does this for the command case."

Then you wanna go on vacation and the new guy messed up the API request last week, so you figure, "I'll just add a little JS interface with a little red warning if the request is messed up in this way or that before I go".

You get back from vacation and some original interested party (whoever has wanted all these changes deployed) watched the intern make the change and thinks they could just do it themselves if they had access to the interface. You're wary, but you make the changes together a few times and maybe even add a little "wait-for-approval" node in the state machine.

Life is good. You've basically de-looped yourself, aside from a quick sanity check and button press, instead of what was a ~2 hour code + build + PR + PR approved + deploy process.

Then that interested party goes to work for Uber and the rest of your team adds a few functionalities on top of the interface you built and it all goes pretty well, until you realize that now that this thing that used to be 20 YAML objects is now 50k database records, and a bunch of them don't even apply anymore. So you build a button to disable some group of them, but after getting it deployed you realize it's actually possible to issue a "disable all" request accidentally if you click a button in your janky JS front-end before the 50k records download and get parsed and displayed. Oops! This mistake that you and the original interested party would have never made (because you spent the last 2 years thinking about all this crap) is probably a single impatient anxious mouse-click away from happening. So you make a patch and deploy that.

Congrats! You found that particular failure mode and added some protections for it, and maybe added some other protections like rate-limiting the deletions or updates or whatever. That's cool, but is that every failure mode? I bet it isn't. What happens when someone else thinks you have too many endpoints and just drops to SQL for the update?

Basically, yeah, of course you think of this stuff while iterating on it. But you figure "only power users are on the ACL" or "my teammates will understand the data model before making changes, or ask me first" or "that's what ROLL plans are for" or "I'll show a warning in the UI" or whatever. Fundamentally, you're thinking about a way to do a thing, if you're even thinking about it at all.

So yeah, that's what I've spent the last year or two doing. :-)

Even I have been doing this, but not quite at this scale, it is mostly python scripts to automate something, but because of the low scale and that I am the sole owner + user, I am good to go :-D