Hacker News new | ask | show | jobs
by ram_rar 1828 days ago
> On June 22nd at 8:25 AM UTC, we released a new update designed to reduce the download size of the optimization database. Unfortunately, this managed to upload a corrupted file to the Edge Storage.

I wonder, if simple checksum verification of the file would have helped in avoiding this outage all together.

> Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically dead

This is exactly, why one needs a canary based deployments. I have seen umpteen amounts of issues being caught in canary, which has saved my team tons of firefighting time.

3 comments

I wonder, if simple checksum verification of the file would have helped in avoiding this outage all together.

Oh man, you stirred up a really old Cloudflare memory. Back when I was working on our DNS infrastructure I wrote up a task that says: "RRDNS has no way of knowing how many lines to expect or whether what it is read is valid. This could create an issue where the LB map data is not available inside RRDNS."

At the time this "LB map" thing was critical to the mapping between a domain name and its associated IP address(es). Without it Cloudflare wouldn't work. Re-reading the years old Jira I see myself and Lee Holloway discussing the checksumming of the data. He implemented the writing of the checksum and I implemented the read and check.

I miss Lee.

For whom, like myself, don't know the story, here it is: https://www.wired.com/story/lee-holloway-devastating-decline...

I'm deeply moved after reading it. Can't imagine how tragic it must be for people who know Lee.

Sounds similar to what happened to Nietzsche:

https://en.wikipedia.org/wiki/Friedrich_Nietzsche#Mental_ill...

That was an incredible story, and I went down a rabbit hole of reading more about that disease. Thank you very much for sharing.
Wow, that is absolutely tragic. Neurodegenerative diseases are something I fear the most, having seen what Huntington's can do to somebody.
In the post or comments, they claimed using canary; perhaps their canary simply didn't die in the coalmine?
That doesn't protect against a file already being generated in a broken fashion. Or that it's content is not compatible to the newest schema you are using for deserialization.

For serialization in a distributed system you always want to have a parser which can detect invalid data and has means to support forward and backward compatibility.

> forward and backward compatibility

Also for HTTP requests I suppose