Hacker News new | ask | show | jobs
by maccam94 1847 days ago
FB doesn't want hardware to run at lower than rated speeds. Their tool allows them to detect when it happens and remediate the issue.
1 comments

OP claims it shouldn’t be “difficult to detect (...) because the hardware is working” because most commercially sold host controller chips would generate interrupt and report errors, unless Facebook is using something nonstandard that don’t.
The hardware is reporting the errors to the kernel but not crashing the system. It's "difficult to detect" because unless you are specifically monitoring for those stats, the only issue you'll see is degraded performance on an occasional machine (assuming you are watching carefully enough to even discern the performance delta). Some of the error counters are even predictive of an issue rather than something that is actively impacting performance. The FB software is basically scraping those messages and bus stats into JSON that can be consumed by their monitoring infrastructure.