Hacker News new | ask | show | jobs
by acveilleux 3810 days ago
Has your buffering got any back-pressure at all? Or do you buffer until the disk is full?
1 comments

It would need to be down for ~96 hours under our heaviest 4 day period in our history for it to fill the disk. All of them would have to also go down simultaneously.

There is backpressure at 85% disk fill (this also is an on-call trigger event since it shouldn't ever happen in practice). Suffice to say, this never happens in the real world without hardware failures.

Disks are cheaper than fatiguing engineers with on-call events and this is accomplished by basically having a nearly empty 3 VM cluster with 512 GB SSD (Raid 1 pair, so 2 disks) each. The load on the rest of the VM host is negligible given its being asynchronously processed from this cluster already, so its just really filling the extra disks on the VM host we dedicated to this purpose.

Just realize we build to 4 9s.

YTD failures 3662/75631129 = 0.00004841921

This doesn't trigger an on-call event because it recovered automatically but the cluster does fail every so often for ~3k events. This is for a single API call.