|
|
|
|
|
by cordite
3165 days ago
|
|
The level of detail and linearity is impressive. At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix. At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir. |
|
Reliability testing is definitely something we're interested in as we spin up more SRE/reliability focused individuals, but also has probably the least amount of cost-benefit for us (compared to engineering effort on improving the things we know need work). Some of the failure in the system we experienced is related to issues we know about, but haven't prioritized (read; had time for) yet.
For the library, we believe the bug is related to hackney and the fact it uses the high priority setting for its pool process. For some reason (this is the part we're not entirely sure on, and still spending some time investigating) this high priority process got stuck and consumed all of the scheduler time (presumably related to the earlier API degradation), breaking the distribution port and the application in a weird way. Oddly enough the systems we run on are SMP, so in theory one rogue process should not be able to have this effect.