Hacker News new | ask | show | jobs
by fweespeech 3807 days ago
1) We don't keep a formal on call but 3 of us are tied to an automated alert system and whoever has a chance to take care of it, does. We are all full stack devs so generally we can fix it at the time. If its complicated, we can help it hobble and fix it later.

2) We get 3-4 alerts a year that have to be handled before the next business day.

3) As such, there is no real work priority, triage, etc. You resolve it immediately. There is no other priority. [ Any on-call event == lost money ]

> * how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers)

Asynchronous processing. I buffer until their API is back up. I do this for literally dozens of companies from small manufacturers to Amazon.

There really isn't any other good way to handle it and if you need to do it otherwise that really is a fundamental architectural problem that should have been resolved at the design phase.

1 comments

Has your buffering got any back-pressure at all? Or do you buffer until the disk is full?
It would need to be down for ~96 hours under our heaviest 4 day period in our history for it to fill the disk. All of them would have to also go down simultaneously.

There is backpressure at 85% disk fill (this also is an on-call trigger event since it shouldn't ever happen in practice). Suffice to say, this never happens in the real world without hardware failures.

Disks are cheaper than fatiguing engineers with on-call events and this is accomplished by basically having a nearly empty 3 VM cluster with 512 GB SSD (Raid 1 pair, so 2 disks) each. The load on the rest of the VM host is negligible given its being asynchronously processed from this cluster already, so its just really filling the extra disks on the VM host we dedicated to this purpose.

Just realize we build to 4 9s.

YTD failures 3662/75631129 = 0.00004841921

This doesn't trigger an on-call event because it recovered automatically but the cluster does fail every so often for ~3k events. This is for a single API call.