|
|
|
|
|
by jlebar
4020 days ago
|
|
To me, this sort of thing brings home the value of not running your own machines. Sure, Amazon's/Google's clouds have quirks, but it's far less likely that you're going to have to debug faulty hardware in this way. It sounds like a team of more than one person worked on this at least part-time for weeks -- how much is that worth? It's not just the cost of hiring extra people to do the work; often small companies simply can't hire enough good people -- when you do find them, do you want to squander them twiddling servers? |
|
At a place I used to work at we had a reasonably large cluster of Windows boxes on Amazon. Randomly, Windows machines on Amazon would suddenly stop accepting new TCP connections.
This means that machines would be running fine, and then half your cluster starts dropping offline. At the time when this happened to us, there were no other reports we could find of this happening.
Turns out, it's some bug in the Xen Virtual NIC driver that wasn't running the offloaded TCP cleanup, and so eventually the system couldn't accept any new connections. Once we figured out was happening we could pre-emptively reboot boxes, but that was a problem for us for about 6 months iirc.
There's probably dozens of these bugs affecting someone on these cloud platforms at any one time. But because you have no access to the hardware, you don't even have the option of saying "Screw it, lets just get different hardware". You're at the mercy of your cloud provider.