Hacker News new | ask | show | jobs
by mandarg 3260 days ago
Here's my attempt at an explanation, after reading the article.

Most companies' networks have edge routers (which sit at the points where they connect to other networks) and core routers (which manage the flow of traffic inside the network. All these routers basically use a standard protocol called BGP (Border Gateway Protocol) which is defined by RFC 4271.

However, BGP was still designed from the view of individual machines making routing decisions and announcing routes to each other that collectively make up the whole Internet. This helps the Internet as a whole be quite resilient – if one network goes down, there are still ways to route traffic through to other networks. Also, since the protocol is standard, you can swap out one vendor's gear for another at will (in theory anyway) as long as you know how to configure it correctly.

But this leads to some inefficiency – for instance, it is very hard to say that a path with fewer hops will lead to lower latency. What Google seems to have done is to make their edge routers into one single "intelligent" network, where the edge routers don't make routing decisions on their own, but feed their data into a central server. This central server can then say stuff like "My peering router in NYC seems to be under heavy load, let me redirect some of my traffic to NYC destinations through the NJ datacenter instead", or something to that effect; while still doing the correct BGP announcements from the point of view of Level3 or whoever is peering.

In short, they built their internal network from the ground up since they are so big they can afford to build custom routing gear instead of using the standard off-the-shelf, standardized setup that a small or medium-sized company uses. The network consisting of their custom edge routers (all the green blobs) together is called Espresso and represented by a light grey circle.

1 comments

"Central server" for routing. Uh oh.

AT&T used to try to avoid centralization, but ended up with routing controlled from Bedminster, NJ.[1] An interesting comment from AT&T's NOC tour guide is that load doesn't vary much any more. AT&T used to have holiday calling surges and such, but now, in an always-on world, overall load is relatively steady.

[1] http://fortune.com/att-global-network-operations-center/

You can easily employ redundancy for core services. Core functionality doesn't need to be synonymous with central point of failure. For example, you have three healthy copies of the core service running at all times, combined with failover. All software systems have a core function that must work or else the system fails, but that doesn't mean all software systems are centralized in any useful sense of the word.
The problem i would see with centralization of routing is not reliability, but rather susceptibility to censorship; as the old saying goes "The Net interprets censorship as damage and routes around it" - while one could argue that root dns servers are central-ish, they are hosted all over the world, so a single actor can hardly impose worldwide censorship. It is a known fact that malicious actors can manipulate BGP (see Hacking Team), but it is still better than putting all of our collective eggs in Google's basket...

Distributed things though, mesh networks, IPFS - that's what's giving me hope.

It's interesting to know that a traditional ISP like AT&T is heading the same way.

I think Google gets more freedom to try out some of these techniques because people still fundamentally think of them as a website (apart from Google Fiber, they don't serve end-users directly); whereas AT&T, being an ISP, is treated more like a water / power service, in that people expect them to be working by default, and going down is absolutely unacceptable.

AT&T, pre-Internet, had 10 major regions in the US, and switches had a fixed list of primary, secondary, and tertiary routes. The first "centralization" was simply that the priorities in the routing tables were changed every few minutes based on load. But if the central routing planner went down or was unreachable, everything still worked, just not as optimally.

What you don't want is software-defined networking where every new flow goes to Master Control for validation and routing. Some SDN systems do that, and they have a central point of failure and censorship.