Hacker News new | ask | show | jobs
by unilynx 1824 days ago
I read it as "DNS software changed, that worked fine, but it turns out we sometimes generate a broken database - not often enough to see it hit during canary, but devastating when it finally happened"

GP also notes that this database changed perhaps every 30 seconds

Just a few guesses.. if you have a process that corrupts a random byte every 100.000 runs, and you run it every 30 seconds, it might take days before you're at 50% odds of having seen it happening. and if that used to be a text or JSON database, flipping a random bit might not even corrupt anything important. Or if the code swallows the exception at some level, it might even self-heal after 30 seconds when new data comes in, causing an unnoticed blib in the monitoring if at all

Now I don't know what binary pack does exactly, but if you were to replace the above process with something that compresses data, a flipped bit will corrupt a lot more data, often everything from that point forwards (where text or json is pretty self-syncronizing). And if your new code falls over completely if that happens, no more self-healing.

I can totally imagine missing an event like that during canary testing