| HN Mirror

Thanks, we try our best with these. Past experience has shown they can be very valuable, and help everyone at the company get context on the system and how we handle failures.

Reliability testing is definitely something we're interested in as we spin up more SRE/reliability focused individuals, but also has probably the least amount of cost-benefit for us (compared to engineering effort on improving the things we know need work). Some of the failure in the system we experienced is related to issues we know about, but haven't prioritized (read; had time for) yet.

For the library, we believe the bug is related to hackney and the fact it uses the high priority setting for its pool process. For some reason (this is the part we're not entirely sure on, and still spending some time investigating) this high priority process got stuck and consumed all of the scheduler time (presumably related to the earlier API degradation), breaking the distribution port and the application in a weird way. Oddly enough the systems we run on are SMP, so in theory one rogue process should not be able to have this effect.