Hacker News new | ask | show | jobs
by ksmith14 2889 days ago
The Google SREs mentioned this in their book; the Chubby locking service had uptime that was so high that folks started to neglect making their own services resilient to Chubby failures: https://landing.google.com/sre/book/chapters/service-level-o...
2 comments

+1 for this book. As a junior DevOps engineer this book has been super helpful.
the book is structured in a way that makes it pretty easy to jump around and pick and choose which parts you want to read or skip, so it's not a very large commitment to read it
Mine just came in the mail today. Pretty stoked.
Still that's bad design on the clients' part. E.g. - Just because malloc "never" fails doesn't mean it can't fail :) so better error check for it.
Doesn't matter. Engineering around human failure is part of the profession.
That's a beautiful way to put it. I'd read that book.
Well, I'm a Google SRE so...
Failure of malloc() might be a bad example to pick because on linux, by default, most distros overcommit, so malloc won't fail, generally. Instead, malloc will succeed allocating the address space just fine, but the RAM will get allocated upon first use, meaning that even though malloc gave you a supposedly valid pointer rather than NULL, actually using that pointer will crash your program.
Other distros may have this differently and return NULL. It's not portable and also just bad to not check for it.
Is there a way to fix this/switch it off? I never got the rationale for this behaviour.
There's a sysctl: vm.overcommit_memory=2

What most people don't realize is that you will get more OOMs if you disable overcommit.