| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btilly 1965 days ago

To be fair, xml-rpc truly is terrible. To start with, XML is extremely verbose and you add a lot of network overhead for that, plus a lot of parsing overhead, which makes rpcs significantly more expensive than they should be. Use https://capnproto.org/ instead. (Based on protobuff, which is an open sourced version of what Google uses internally.) As far as I'm concerned, the only valid reason to use xml-rpc is because you're interfacing with someone else's system and they have chosen to use it.

Moving on, here is an excellent example of an important architecture system that almost everyone gets wrong. In any distributed system you should transparently support having rpc calls carry an optional tracing_id that causes them to be traced. Which means that you log the details of the call with a tracing id, and cause all rpcs that get emitted from that one to carry the same flag. You then have a small fraction of your starting requests set that flag, and collect the logs up afterwards in another system so that you can, live, see everything that happened in a traced rpcs. To make this easy for the programmer, you build it in to the rpc library so that programmers don't even have to think about it.

You then flag a small random small fraction of rpcs at the source for tracing. This minimizes the overhead of the system. But now when there is a production problem that affects only a small percentage of RPCs you just look to see if you have a recent traced RPC that shows the issue, look at the trace, and problems 3 layers deep are instantly findable.

Very few distributed systems do this. But those that do quickly discover that it is a critical piece of functionality. This is part of the secret sauce that lets Google debug production problems at scale. But basically nobody else has gotten the memo, and no standard library will do this.

Now I don't know why they reinvented xml-rpc for themselves. But if they had that specific feature in it, I am going to say that it wasn't a ridiculous thing to do. And the reason why not becomes obvious the first time you try to debug an intermittent problem in your service that happens because in some other service a few calls away there is an issue that happens 1% of the time based on some detail of the requests that your service is making.

1 comments

tguvot 1965 days ago

It happened 13 years ago and it was exactly same but with different xml syntax :) It was way before microservices, etc. purely point to point.

btilly 1964 days ago

It was way before microservices, etc.

It was not way before microservices at Google. But it was before there was much general knowledge about them.

Sadly the internal lessons learned by Google have not seeped into the outside world. Here are examples.

1. What I just said about how to make requests traceable through the whole system without excess load.

2. Every service should keep statistics and respond at a standard URL. Build a monitoring system which has scrapes of that operational data as a major input.

3. Your alerting system should supports rules like, "Don't fire alert X if alert Y is already firing." That is, if you're failing to save data, don't bother waking up the SRE for every service that is failing because you have a backend data problem. Send them an email for the morning, but don't page people who can't do anything useful anyways.

4. Every service stood up in multiple data centers with transparent failover. At Google the rule was n+2/n+1. Meaning that your service had to globally be in at least 2 more data centers than it needed for normal load, and in every region it had to be in at least one extra data center. With the result that if any data center goes out, no service should be interrupted, and if any 2 data centers go out the only externally visible consequence should be that requests might get slow.

Now compare that to what people usually do with Docker and Kubernetes. I just have to shake my head. They're buzzword compliant but are failing to do any of what they need to do make a distributed system operationally tractable. And then people wonder why their "scalable system" regularly falls over and nobody can fix it.