|
|
|
|
|
by ram_rar
1828 days ago
|
|
> On June 22nd at 8:25 AM UTC, we released a new update designed to reduce the download size of the optimization database. Unfortunately, this managed to upload a corrupted file to the Edge Storage. I wonder, if simple checksum verification of the file would have helped in avoiding this outage all together. > Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically dead This is exactly, why one needs a canary based deployments. I have seen umpteen amounts of issues being caught in canary, which has saved my team tons of firefighting time. |
|
Oh man, you stirred up a really old Cloudflare memory. Back when I was working on our DNS infrastructure I wrote up a task that says: "RRDNS has no way of knowing how many lines to expect or whether what it is read is valid. This could create an issue where the LB map data is not available inside RRDNS."
At the time this "LB map" thing was critical to the mapping between a domain name and its associated IP address(es). Without it Cloudflare wouldn't work. Re-reading the years old Jira I see myself and Lee Holloway discussing the checksumming of the data. He implemented the writing of the checksum and I implemented the read and check.
I miss Lee.