Hacker News new | ask | show | jobs
by treflop 752 days ago
It’s possible to figure out exactly what failed if you know how it all works.

But to write a tool to provides a useful description to the user is near impossible because no two setups are the same, it’s not possible to know if something is intentional or not, and it can be dangerous to just make an assumption based on what the common causes are and just suggest to the user a completely wrong answer.

For example, let’s say you can’t connect to a website because the DNS server isn’t responding and the host isn’t responding. You could tell the user that something is probably misconfigured at your router or your ISP is having some issues.

However, it turns out that the actual reason was that your VPN client updated your local routing tables and DNS server but failed to remove the changes when you quit the client. How is a troubleshooter supposed to know that the settings were temporarily changed versus it being the permanent ones?

Once you try to start to write a troubleshooter that can identify the actual cause, you realize that it’s very difficult due to the complexity and variation. At best you can write something that usually spits out a correct answer but also sometimes suggests something totally wrong and leads people down a completely wrong path.

1 comments

If Google dedicated 10 engineers full time to this problem for 3 years, could they solve it?
I work for an acquired startup that tried to solve this problem.

It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.

We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.

The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.

Can you share the name of the company or more product details? Curious to know more about what solutions in this space look like
Yes, and they partially have. Browsers are great at telling you where the chain has failed/ been cut, though some error messages seem to be intentionally uninformative as provided information would be meaningless to your average user.

That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.

I regularly run `mtr 1.1` for monitoring network condition. One of its display modes gives you a 3D view: x-axis is time, y-axis is the hops, and each cell’s colour and character indicates how long the ping took (or if it got no response). This is frequently very valuable at identifying where a problem is, which is generally one of these three: between computer and router, router and ISP, ISP and public internet. It can show also where packet loss or latency jumps are occurring, and patterns where something goes wrong for a few seconds so that you can determine where the problem is (this is where the time axis is crucial).

One thing that becomes apparent when you monitor diverse ISPs and endpoints this way is the inconsistency: in a normally-functioning situation, although most hops will have 0% loss, some will have absolutely any value from 0%–100%. The network I’m on at present has ten hops from _gateway to one.one.one.one; hop five is 100% loss, hop six varies around 40–50% loss, hop seven is about 60–62% loss, the rest are all 0% loss. It does host name lookup as well which can be a little bit useful for figuring out what’s probably local, probably ISP and probably public internet, but the boundaries are often a bit fuzzy.

mtr: <https://en.wikipedia.org/wiki/MTR_(software)>

1.1: short spelling of 1.0.0.1, the second address for Cloudflare’s 1.1.1.1 DNS server.

You can switch between the display modes with the d key, or start in this mode with MTR_OPTIONS=--displaymode=2 in the environment (which is how I do it, as it’s almost always what I want; if it weren’t, I’d probably make some kind of alias for `mtr --displaymode=2 1.1` instead).

> some will have absolutely any value from 0%–100%.

Seeing packet loss in mtr is not entirely indicative of the health of the host. Some public servers filter out ICMP all together, and others add a firewall traffic shaping limit to the number of pings they reply to. You might be seeing that.

As long as you only ever visit Google web properties, yes.
Short answer: No.