Hacker News new | ask | show | jobs
by btilly 5812 days ago
There are lies, damned lies, and SLAs. Personally I only find an SLA useful if it is worthwhile. Most of the SLAs out there aren't. And for good reason. You should probably offer one, but like a smart company shouldn't make the burden too bad.

Suppose someone doesn't respond to a page. Is it because they were too far asleep to hear the paging device? Because the paging device didn't work? Because some other problem kept them from working on the page remotely? Because their carrier blocked the page? Because you broke down? Because the problems in their system kept them from sending you the information in the first place?

There are a lot of points of failure. And your service is not one of the more likely ones to break. Furthermore if there is a dispute, whose records win? They didn't respond to a page, your records say they never sent the page. They blame you, how do you resolve that?

Therefore I'd suggest offering an SLA, but make it be something like, "If you missed a page and are convinced that it was our fault, we'll refund the last X months." From your point of view it is a no questions asked refund policy, that carries with it the consequence that that person is not allowed to sign up for your service. (Unless, of course, you're convinced it was your fault they didn't receive their page.) But whatever you do, be careful not to accept potential liability for something that likely was their problem.

I would also suggest that you share best practices. For instance an important one is that companies need to provide a well-defined escalation path. Recognize that humans fail (whether because of not waking up, being in the process of driving, etc) and so people are unreliable components that need a fall-back mechanism. The act of educating your clients about things like this will help them avoid problems that could cause them in an imperfect world (ie the one we live in) to become unhappy with you.

1 comments

SLAs with exceptions based on "fault" are meaningless. Either you guarantee you will keep your shit working, or you don't.

(Either way is fine, really... but arguing over "fault" is not a productive activity.)

It's not meaningless? If "working" is dependent on several pieces working, and only some of them is under your control, you can be in a state of "not working" without being at fault.

I've had a server go down for a large group of users because of a malconfigured routing table between them and the server. If we'd had an expensive SLA, there would have been significant "what the heck is it we're paying for, then?" discontent.

right. my point is that if you are selling the customer a service, and you say 'I will get you network connectivity' and then, for reasons outside your control, you don't get them network connectivity, it doesn't make much difference to the customer if the network is broken because you did something dumb or if the network is broken you are getting DDos'd from china. the point is that the network is broken.

last month I paid out almost fourteen grand in SLA credits because I didn't stop a DDos within my allowed 0.5% downtime. Was it my fault I got DDos'd? no. However, i was the only one in a position to do something about it. (and really, if I wasn't tired and generally an idiot, we would have been down for an hour rather than 8.)

You do need clear lines, though. if you need connectivity from point A to point B, that's easy, I can guarantee that. But defining connectivity to 'the internet' is harder. there are cases where I've got good connectivity to most places, but you can't get to some ISP in dallas, because they've hoarked up the routing table.

Right now, I play that sort of thing by ear. If only one customer is having the problem, I try to figure out where it is and if I can't figure it out, it's not that big of a deal to give them a credit. If many customers are having the problem, well, then I have a problem, and really, it's my job to figure out where that problem is and to work around it... even if that problem is a misconfigured router at some other ISP. I mean, really, what is the customer going to do about that sort of thing?

this is the point of having a SLA; it aligns the interests of the service provider with the interests of the customer.