Hacker News new | ask | show | jobs
by Dobbs 3480 days ago
From a engineering point of view this is really cool, but as an ex-sysadmin I feel that I need to reiterate and emphasize something that is alluded to in the second paragraph.

Too many things can go wrong and you are all around better off outsourcing this to something like Pingdom. You don't have sufficient levels of reliability, you aren't dual homed across twilio and another phone system. Maybe the cause of your outage is that AWS is having issues. Now your site and your monitoring is down.

Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.

5 comments

> outsource to people who obsess over doing this right

Completely agree! I often have to fight that "I could just build that myself" mentality, which glosses over the points you made so well.

This and that!

It's the same as "Twitter clone" with just posting messages with 140 char limit and "build a blog in 15 minutes".

Alerting over a downed website are is sorta like a glacier, there's so much under the surface, if you just see the surface you're missing out.

1. Multiple locations 2. Multiple check intervals 3. SMS/email provider switch on fail 4. Auto recovery of your checkers 5. Multiple providers with a single storage.

> Now your site and your monitoring is down. Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.

You make valid points about redundancy and levels of reliability but keep in mind that even Pingdom can go down: http://royal.pingdom.com/2016/10/24/ddos-attack-affects-ping...

Chances are that pingdom won't be down at the same time that your site is down.

Diversify to avoid cascading failures ;)

With your own solution you will likely encounter the same problems that pingdom faced including this one. The benefit of a service like pingdom is that they already solved those problems for you or if they haven't you don't have to waste time solving them yourself. It's not very efficient if everyone solves the same problems over and over again.
Use 2 or more providers. Signing up takes a minute or two and there are free alternatives
My favorite issue recently came up with a Django app of mine which was set up to email me when a request errors out. Turns out, when I switched which server it ran on I misconfigured the email settings and one of the errors was caused due to the inability to send an email. Thankfully it only took a few days to figure this out.
We’ve had issues with Pingdom at work. We don’t use them ourselves, but we host web sites, and some customer of ours used Pingdom to monitor their web site hosted on our servers. The customer would complain to us about downtime reported by Pingdom, but we would read the logs and find everything OK, with multiple successful accesses from other people during the time which Pingdom reported our customer’s site as being down. A huge pain.
Doesn't services like Pingdom support multiple ping locations? If all of those fail, there's a very high chance there's an actual problem, if not with your server then with your (ISPs) connectivity.
The question is what customers of monitoring systems expect from the monitoring. Do pingdoms explain what a failure means, or are they only providing data and then its up to the customer to interpret that data.

Multiple ping locations is helpful in bringing more data points, but it doesn't address the problem of explaining what the data means. For example, pingdom could provide triangulation of the failure if fault identification was part of the businesses model of monitoring.

I would describe the criticism of pingdom as a failure of expectations. Pingdom is not a security service, a monitoring service, or fault identification service. They are a single test, and the data you get back is useless unless interpreted and verified.

If our ISP was down, we would not have had successful accesses from other people at the same time. If some transit ISP was down somewhere between us and Pingdom, well, that’s the Internet for you, eh? Regardless, Pingdom would report us as down, even though we weren’t at fault.
Yes you were down for some of your users. If that's ok for you that's fine. But if I were you I would be calling my ISP and trying to sort out why customers from location X can't access but customers from location Y can.

If you're providing a service to your users, and they say that the service is down using pingdom, you should be looking into, not just saying "Works on my machine".

Why should we be the ones to look into it? It was a random intermittent short-duration fault in the middle if the Internet, at some unknown place on the then-current path between us and Pingdom. Why should not Pingdom be at least equally as obligated to look into it? After all, they’re the ones actually using the failing connection, in order to monitor our and others’ services. But no, Pingdom simply report us as being down, and leave the hard part to us; i.e. the part where we have to explain to our customers that the Pingdom report is actually provably incorrect.

I mean, what qualifies as “being up”? If some random link in the middle of the Internet goes down, and you suddenly, for 30 seconds, are unreachable for the few hundred people going through that exact link because it happens to be the best path between those people and your server, can they claim that you have failed to provide adequate uptime? If such a fault happens, are you then responsible to troubleshoot it? I say no. The Internet is the ISP’s responsibility, and the only faults actually meaningful to report to your ISP are the repeatable or long-lasting ones. Small stuff like this is not worth anybody’s time (except ISPs) to go digging into.

Well if you're not providing a service to others, then you shouldn't be the ones to look into it. But if you're providing a service to users and they tell you it's down then you should. It might be that your ISP has a misconfigured route that is flapping and sometimes causes errors in some locations. Or a netmask is wrong somewhere and certain ip address can't be accessed. It might not be a temporary thing. And you if it's your ISP fault they might be able to fix it.

You've seem to think that you have to investigate the issues. On the contrary, you bump it up to your isp to investigate. If your ISP is regularly having these issues then it might be time to change ISPs to one with a better peering agreement.

We've seen the same with Pingdom. ISPs are (mostly) multi-homed and Pingdom might use just one (affected) route to the ISP in question and then fail badly.

If Pingdom can't get to your site, it's highly likely your users can't either.

That was not the case with us; Pingdom would report short outages, like a few seconds here, a couple of minutes there, and only a handful of occurrences for the whole report duration (IIRC).
Yep, plus most engineering time is worth at minimum $60+/h, which would pay for a year or more with most of these services.
On the other hand, it's 'set up once and it just keeps chugging along', and isn't Yet Another SaaS To Manage.

Also, if you want a 'proper' ops alerting SaaS, you're looking at something along the lines of $50/user/mo or $15/server/mo, neither of which is trivial.

Yeah assuming nothing falls apart with the custom implementation maintenance-wise. Programmers have a hard time focusing on their real goals though, we often re-implement things that really aren't worth the time or money.