| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Operyl 2243 days ago
	Maybe. About 3/4 of all outages get a post mortem. There's 1/4 of the time they refuse to tell us anything.

2 comments

mbreese 2243 days ago

There will have to be a post mortem on this. The convention is to be as transparent as possible as to what went wrong. This helps to let current customers know that you found the problem, and have put plans in place to make sure it doesn't happen again.

The purpose of the signalling here is two fold.

1) If convincing enough (with details), you can keep current customers from moving to a competitor.

2) It also lets new customers see how you actually handle a crisis. If they can manage the crisis well enough, then you can point to this instance to prove your technical knowhow to handle their needs.

If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.

link

bigiain 2243 days ago

> then they can expect a mass exodus of customers

I wonder if that's a thing that would even cross a typical IBM-ers mind? It might just be me, but I get a very strong smell of "We're IBM! There's nowhere else for you to go!" from them...

link

colinbartlett 2243 days ago

Do you actually have data on that or are you conjecturing? Because I would really love to see data about that if it exists somewhere.

link

Operyl 2243 days ago

I'm talking from experience. Most things do get post mortems, but there's a lot of crap they also don't give us post mortems for "because customer data." It's my number 1 complaint, and I fight with managers about this all the time. We have a ton of hypervisor problems, and a lot of networking issues (generally over private network) and they tend to get very very secretive about it.

link

toast0 2243 days ago

I didn't use their hypervisors, but I've had a lot of experience troubleshooting their networks. They've gotten a lot better at proactive monitoring, but we used to occassionally find some private networking paths that were having trouble, and until we narrowed it down, it was hard to find. (I dunno, I guess you can't just ask all the routers if there are any ports with errors, but sure enough, when they found the right port, there was usually a huge error count, or something)

The key thing is each IP 5-tuple (peerA, peerB, protocol, portA, portB) will always take the same path over their network (most likely a different path for return packets, when A and B are switched), so in order to properly probe, you need to probe on a lot of of port combos, and once you find a broken combo, you need to run MTR on those ports, so you can give them the MTR that shows the issue.

Or, if you can, have your internode protocol run on multiple connections and drop connections that are showing issues, and let a different customer file the tickets :)

(email is in my profile if you want to discuss)

link

mbreese 2243 days ago

IBM cloud specifically or just in general?

link

Operyl 2243 days ago

I'm talking about IBM Cloud specifically, yes.

link