| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jessriedel 751 days ago

Tangential question from a layman: when I lose access to a particular website, or the internet as a whole, why is it so hard to tell where in the chain the failure is occurring? Like it’s often unclear whether

* I’ve got a network misconfiguration on my local machine;

* My wifi connection to the router is down;

* The cable between my router and ISP is cut;

* My ISP is having large scale issues; or

* The website I’m trying to reach is down.

I’ve been given the vague impression that it has something to do with a non-deterministic path by which requests are routed, but this seems unconvincing. If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

9 comments

treflop 751 days ago

It’s possible to figure out exactly what failed if you know how it all works.

But to write a tool to provides a useful description to the user is near impossible because no two setups are the same, it’s not possible to know if something is intentional or not, and it can be dangerous to just make an assumption based on what the common causes are and just suggest to the user a completely wrong answer.

For example, let’s say you can’t connect to a website because the DNS server isn’t responding and the host isn’t responding. You could tell the user that something is probably misconfigured at your router or your ISP is having some issues.

However, it turns out that the actual reason was that your VPN client updated your local routing tables and DNS server but failed to remove the changes when you quit the client. How is a troubleshooter supposed to know that the settings were temporarily changed versus it being the permanent ones?

Once you try to start to write a troubleshooter that can identify the actual cause, you realize that it’s very difficult due to the complexity and variation. At best you can write something that usually spits out a correct answer but also sometimes suggests something totally wrong and leads people down a completely wrong path.

jessriedel 751 days ago

If Google dedicated 10 engineers full time to this problem for 3 years, could they solve it?

avoid3d 750 days ago

I work for an acquired startup that tried to solve this problem.

It’s been around 8 years and we’re up to 50 or so people. I’d say we are okay at it.

We haven’t gotten fundamentally better over time recently, it’s more like there is some asymptote of how much you can really tell with a certain amount of insight into the systems between source and destination.

The only real progress we’ve made has been integrating with more and more sources of information about the state of the network.

reasonabl_human 750 days ago

Can you share the name of the company or more product details? Curious to know more about what solutions in this space look like

awesomeMilou 751 days ago

Yes, and they partially have. Browsers are great at telling you where the chain has failed/ been cut, though some error messages seem to be intentionally uninformative as provided information would be meaningless to your average user.

That said, from an enthusiast perspective, running traceroute to the nearest google service (1e100.net for example) will already give you a huge tip on where things went wrong.

chrismorgan 751 days ago

I regularly run `mtr 1.1` for monitoring network condition. One of its display modes gives you a 3D view: x-axis is time, y-axis is the hops, and each cell’s colour and character indicates how long the ping took (or if it got no response). This is frequently very valuable at identifying where a problem is, which is generally one of these three: between computer and router, router and ISP, ISP and public internet. It can show also where packet loss or latency jumps are occurring, and patterns where something goes wrong for a few seconds so that you can determine where the problem is (this is where the time axis is crucial).

One thing that becomes apparent when you monitor diverse ISPs and endpoints this way is the inconsistency: in a normally-functioning situation, although most hops will have 0% loss, some will have absolutely any value from 0%–100%. The network I’m on at present has ten hops from _gateway to one.one.one.one; hop five is 100% loss, hop six varies around 40–50% loss, hop seven is about 60–62% loss, the rest are all 0% loss. It does host name lookup as well which can be a little bit useful for figuring out what’s probably local, probably ISP and probably public internet, but the boundaries are often a bit fuzzy.

mtr: <https://en.wikipedia.org/wiki/MTR_(software)>

1.1: short spelling of 1.0.0.1, the second address for Cloudflare’s 1.1.1.1 DNS server.

You can switch between the display modes with the d key, or start in this mode with MTR_OPTIONS=--displaymode=2 in the environment (which is how I do it, as it’s almost always what I want; if it weren’t, I’d probably make some kind of alias for `mtr --displaymode=2 1.1` instead).

mariusor 750 days ago

> some will have absolutely any value from 0%–100%.

Seeing packet loss in mtr is not entirely indicative of the health of the host. Some public servers filter out ICMP all together, and others add a firewall traffic shaping limit to the number of pings they reply to. You might be seeing that.

otabdeveloper4 751 days ago

As long as you only ever visit Google web properties, yes.

evilDagmar 750 days ago

Short answer: No.

nurple 751 days ago

If ICMP is allowed into your network, your machine will most likely receive a Destination Unreachable response from the host that can't forward the packet further.

Your application won't see the ICMP message unless you configure the socket to report them(these are considered "transient" errors). On Linux this is done via the socket option IP_RECVERR.

ETA: there's not a ton of value collecting errors at this layer when you're working at L7. The errors that _do_ get surfaced for DU at your layer will be appropriate for the failure handling logic you'll inevitably have already. In this case I think it'd be a timeout, as other layers implement retries in the face of unreachable destinations.

I found these RFCs helpful re: how the TCP layer handles ICMP errors: https://www.rfc-editor.org/rfc/rfc1122#page-103

Section 4.2.3.9:

> Since these Unreachable messages indicate soft error conditions, TCP MUST NOT abort the connection, and it SHOULD make the information available to the application.

> DISCUSSION: TCP could report the soft error condition to the application layer with an upcall to the ERROR_REPORT routine, or it could merely note the message and report it to the application only when and if the TCP connection times out.

This one gets into the nitty gritty of how the stacks interact in order to study ICMP as vector for TCP attacks.

https://www.rfc-editor.org/rfc/rfc5927

cancerhacker 751 days ago

The browser reports the error closest to what it was doing at the time - host not found? Well, the network was reliable enough to reach a dns server that returned that the lack of address for a name. But if the dns server itself can’t come reached, it’s some sort of network error between you and that server. The typical way to diagnose that kind of problem is to perform all the steps yourself - can I ping the dns server address? Can I resolve this host with that dns server? What about a different dns server, maybe that particular name is being excluded because of corporate policy. The command line tools ping, traceroute and dig are useful if you want to get into it.

itscrush 750 days ago

Much of this problem space I've solved with running MTR to the destination when troubleshooting to see each hop's detail.

It's like ping + traceroute in a live running session with each hop broken down.

Quite consistent when I am the first to notice a node down on Xfinity network and in the same mtr see my network at least to my modem is good. Or when there's a hop beyond my ISP with 100s of ms added latency, which I haven't seen other tools do well like MTR can.

Won't solve everything, but might be worth your checking in your case as it breaks down per-hop providing latency for each.

AlienRobot 751 days ago

How are you trying to tell that?

If a web browser can't access a URL, it won't tell you why exactly because there's a chance it diagnosis the reason wrong and most users will be confused by that. I assume most diagnosis tools work the same way. You need to make assumptions about how the OS, hardware, and network are configured to be able to say "the problem is here."

For example, when you access a website, the first thing that needs to be done is check a domain name server (DNS) to get the IP address of the web server. But where does the web browser get the DNS IPs from? You can configure it in the browser. Or in the OS. Or in your router. Or in your modem. And if you don't, it gets them from the DHCP server the router connects to, which could be your ISP's DHCP server (then you get your ISP's default DNS) or it could also be some other router in an organization's network.

If the DNS seems wrong it's easy to tell the IP is wrong but it gets hard to say where that IP came from.

Even SSL could be a problem with the server having the wrong certificates or it could be your computer having the wrong certificates.

arccy 751 days ago

http(s) is built on top of multiple layers (HTTP, TLS, TCP, Ethernet...). A broken link in the lower layers can't really be presented as a higher level message (because it has no access to it).

harry_ord 751 days ago

Not a network person, only played with trace route a long time ago but I'm pretty sure that only really happens if you explicitly ask for information about all the middle men.

Most of the time a lot of software kinda doesn't care about what's happening just if it can do what it's told.

For Websites you often get more informative errors like 404, 500 or something else.

recursive 751 days ago

If you're getting a status code like 404 or 500, it means there's no problem between you and the web server. The status codes come from the server. The exception is when you get a gateway/reverse proxy error. Usually 503 I think. That means the web server is down, but there's another server in front of it reporting that it's down.

harry_ord 751 days ago

True, I thought of those as they're just more informative about why you're not getting what you're looking for.

YZF 751 days ago

502 Bad Gateway.

YZF 751 days ago

For most people most issues would in at their home network. So that's a good first guess for any connectivity problems. Rarely it would be somewhere between your home and the ISP. If it's a small rural ISP then it might be ISP->Internet though I'd think that's rare. Most large scale ISPs have enough redundancy and capacity.

As someone else mentioned ICMP addresses certain classes of failures if enabled but I think the historical reason is more along the lines of the Internet was meant to run over lossy connections. For example, when a certain link is saturated routers will just start dropping packets. Reporting each dropped packet back to the sender is just not a good idea, it adds load to a system already potentially operating at capacity. TCP assumes packets can get lost and retransmits them. When a link goes down routing protocols will potentially send those retransmitted packets over a different link/path. I.e. there's no real concept of "connection down" other than the application layer or TCP eventually giving up (which can take a very long time). The kind of ICMP message that will immediately terminate a connection is when the server machine doesn't have anything listening on the destination port.

boffinAudio 751 days ago

Cyclomatic Complexity is why your Operating System can't do this for you.

https://en.wikipedia.org/wiki/Cyclomatic_complexity

There are so many different paths for an error case to follow.

You can of course debug this by reducing the complexity - for example, by watching one of the links in the chain (say, DNS) and seeing if it is failing - but this is the realm of network engineers who get paid mightily to get through this cyclomatic complexity and work at the relevant layers, all the way down to the atoms in the pipe ..

>If some link on the path breaks, why doesn’t the last good link send a message backward that says “Your message made it to me, but I tried to send it the next step and it failed there.”

In fact, the links all do this, but there is simply no provision in your OS - no fancy GUI, perhaps - that allows you to fully understand this without getting overwhelmed by the cyclomatic complexity. Tools exist, and once you learn to use them to tame the complexity - congrats, you're now worth $300k/yr and can go work in San Francisco .. /s ;)