Hacker News new | ask | show | jobs
by semi-extrinsic 1240 days ago
> Other issues are due to misbehaving bad hardware that need to be identified and removed from operation.

> We are actively working on addressing those limitations this quarter.

This always boggles the mind, but I've seen similar several times in the past on different HPC clusters. Hardware bugs that you just cannot seem to shake down, that are triggered just often enough to be a problem but seldom enough to be "impossible" to debug.

2 comments

Maybe someone was Screaming in the Datacenter again, disturbing nearby spinning Disk Drives...
That sounds like almost every dodgy disk drive I've encountered, and those clusters can have hundreds of them.