|
|
|
|
|
by will_hughes
4020 days ago
|
|
If something similar happens to you on "cloud" infrastructure, you're very limited in what you can do to diagnose or work-around the problem. At a place I used to work at we had a reasonably large cluster of Windows boxes on Amazon. Randomly, Windows machines on Amazon would suddenly stop accepting new TCP connections. This means that machines would be running fine, and then half your cluster starts dropping offline. At the time when this happened to us, there were no other reports we could find of this happening. Turns out, it's some bug in the Xen Virtual NIC driver that wasn't running the offloaded TCP cleanup, and so eventually the system couldn't accept any new connections.
Once we figured out was happening we could pre-emptively reboot boxes, but that was a problem for us for about 6 months iirc. There's probably dozens of these bugs affecting someone on these cloud platforms at any one time. But because you have no access to the hardware, you don't even have the option of saying "Screw it, lets just get different hardware". You're at the mercy of your cloud provider. |
|