It would need to be down for ~96 hours under our heaviest 4 day period in our history for it to fill the disk. All of them would have to also go down simultaneously.
There is backpressure at 85% disk fill (this also is an on-call trigger event since it shouldn't ever happen in practice). Suffice to say, this never happens in the real world without hardware failures.
Disks are cheaper than fatiguing engineers with on-call events and this is accomplished by basically having a nearly empty 3 VM cluster with 512 GB SSD (Raid 1 pair, so 2 disks) each. The load on the rest of the VM host is negligible given its being asynchronously processed from this cluster already, so its just really filling the extra disks on the VM host we dedicated to this purpose.
Just realize we build to 4 9s.
YTD failures 3662/75631129 = 0.00004841921
This doesn't trigger an on-call event because it recovered automatically but the cluster does fail every so often for ~3k events. This is for a single API call.
There is backpressure at 85% disk fill (this also is an on-call trigger event since it shouldn't ever happen in practice). Suffice to say, this never happens in the real world without hardware failures.
Disks are cheaper than fatiguing engineers with on-call events and this is accomplished by basically having a nearly empty 3 VM cluster with 512 GB SSD (Raid 1 pair, so 2 disks) each. The load on the rest of the VM host is negligible given its being asynchronously processed from this cluster already, so its just really filling the extra disks on the VM host we dedicated to this purpose.
Just realize we build to 4 9s.
YTD failures 3662/75631129 = 0.00004841921
This doesn't trigger an on-call event because it recovered automatically but the cluster does fail every so often for ~3k events. This is for a single API call.