Hacker News new | ask | show | jobs
by chubot 4545 days ago
I'm not sure exactly what the grandparent comment meant, but I think I have an idea. I only skimmed the contents so take this with a grain of salt.

Your book is focusing on a pretty narrow part of distributed computing. I would rename it "Managing State in Distributed Systems", or "Distributed Storage Systems". Your examples are Bigtable and Dynamo, which fall in this category.

The book seems to be aimed at sort of a "beginning" audience. But the topics are inappropriate for a beginning audience, and skewed for an expert audience.

Real distributed systems try to be stateless wherever possible. You need "big computer science" to manage state in distributed systems, but most code in a distributed system should not manage state. These techniques should be confined to specialized storage systems.

Here are some examples of real world distributed systems that don't use the described techniques to manage state:

  - clusters of stateless web servers + single master database (99%+ of websites people use)
  - message queue / work queue.  A single machine can productively manage 1,000 - 10,000 stateless workers, depending on the workload.
  - MapReduce
  - Original GFS
  - Napster
  - BitTorrent (tracker and trackerless would be interesting to write about)
  - BitCoin
The title seems to imply a practical bent, but it seems more like a collection of ideas (which are important and interesting, but not really what engineers need to know. IMO the #1 skill for distributed computing is to be competent at BOTH programming a single computer and at system administration).

If I wanted to be harsh, I would say it looks like you read a bunch of stuff and didn't work with it or implement it? At the very least, the ideas don't seem to be put in the context of commonly deployed distributed systems.

People need to understand these simpler, more robust, and more performant techniques, and how to apply them to their specific problem domain, rather than blindly throwing consensus at every problem (which is a disturbing trend I've seen).

2 comments

It goes even beyond that. A lot of other very important, fundamental topics belong under the umbrella of distributed systems, starting with routing. The Internet is, after all, a giant distributed routing system.

Another topic that's huge all by itself is peer-to-peer networks, and all their associated aspects, such as structured (DHTs like Chord, Cassandra, etc.) vs unstructured (Gnutella, Kazaa, etc.), P2P search, handling churn, handling peers with heterogenous capabilities, peer selection, topology organization, decentralized routing, file-sharing (torrents) vs streaming (PPLive, Spotify), etc.

Other topics (with several overlapping aspects) include:

- Security, such as Sybil attacks, group key management, etc;

- Overlay networks;

- CDNs;

- Ad hoc and mesh networks;

- MMOs and multiplayer games;

- SCADA and industrial control systems;

- Pub/Sub systems and application layer multicast;

- Distributed file systems;

- Load balancing and bandwidth management;

And that's just off the top of my head... I'm sure I'm missing other important topics.

> It goes even beyond that. A lot of other very important, fundamental topics belong under the umbrella of distributed systems, starting with routing. The Internet is, after all, a giant distributed routing system.

Yes! DNS is also fascinating as far as distributed databases and consistency go:

http://pages.cs.wisc.edu/~akella/CS740/S08/740-Papers/MD88.p...

Indeed, but ultimately covering all of those topics would require an incredible amount of time and effort. So I need to pick and choose my battles as some topics are more important or interesting to me than others. :)
Completely understand. But as chubot suggests, the topic of "Distributed Systems" is really broad and something narrower in the title, such as "Distributed Data Systems" may be more apt.
> The title seems to imply a practical bent, but it seems more like a collection of ideas (which are important and interesting, but not really what engineers need to know. IMO the #1 skill for distributed computing is to be competent at BOTH programming a single computer and at system administration).

I think this is something where different authors will emphasize different aspects. My view is that understanding of how to deal with the evolution of state within a system is crucial. Even systems that are not databases per se still have a dependency on how state is managed because you want to be able to reason about how some specific answer to a computation was derived and what guarantees it comes with (from strong consistency to some alternative but hopefully precise definition). I figure there will be disagreement on whether this important, and that's fine. There are other books.

That does bring up an interesting question: which books on distributed systems do you feel exhibit your preferred approach (free or paid)?

Re: the suggested topics:

Clusters of stateless web servers + single master. This is definitely a common setup, but you need very little if any distributed systems research to implement it.

Queues: I find the larger scale implications of queuing to be rather interesting (specifically, how cascading failures can be caused by an inadequate understanding of interactions between queues) but haven't found a good discussion beyond Google's findings that doing duplicate work often pays off as reduced 95th percentile latency.

MapReduce: There are many good books covering this topic in much more depth and specificity, so I didn't feel like I had that much to add. MR does use the techniques described - beyond job assignment the whole system rests on the DFS which uses block-level replication and some coordination protocol to maintain metadata state.

I kind of assume people have had some exposure to the paradigm at this point and do address MapReduce a bit in the context of the CALM theorem, which notes that a much larger set of relational algebra operations can actually be executed safely without coordination. Another point might be that MapReduce is inefficient in that it provides too much fault tolerance for typical workloads and cluster sizes.

Original GFS: the design has been largely superseded both by newer version of HDFS (e.g. eliminating the single point of failure in the initial design) and Google's (unpublished?) internal equivalents. BTW, the original GFS relies on Chubby, which uses Paxos internally.

Napster, BitTorrent and BitCoin: peer-to-peer systems definitely deserve a more extensive treatment in a later version of the book. The issues here are different in that trust, efficiency and resiliency are more important and I didn't have the bandwidth to handle them in the book as it stands.

Thanks for your comment, and I hope this doesn't sound like a rebuttal - I just wanted to think through the topics you mentioned one by one.

> Clusters of stateless web servers + single master. This is definitely a common setup, but you need very little if any distributed systems research to implement it.

First, define "stateless"? I would not characterized such a system as stateless at all. Even if you're not using sticky sessions (with cache servers/load balancers talking to each other for failover using fairly involved protocols), there's still state that's ephemeral (sessions) in your application server, as well as bulk of the persistent state that's provided with essentially "faith based consistency" (consider typical memcached cluster with client doing consistent hashing -- asynchronously replicated MySQL with failover, etc... -- in case of a failure, neither availability nor consistency are guaranteed).

On the level of protocols design, the whole idea of stateless protocols (REST) vs. stateful ones (sticky load balancers + SOAP, CORBA, RMI, etc...) is by itself a big distributed systems topic.

A web browser talking to a web server is by definition a distributed system. I am typing this as fast as I can, praying not to get the "your link is invalid" error from HN right now -- this a real example of distributed system and cache coherence/consistency/atomicity. Here's one paper that deals with just these sorts of question (in context of NFS): ftp://ftp.cs.berkeley.edu/ucb/sprite/papers/state.ps‎

> Re: Chubby and GFS

Original GFS doesn't rely on Chubby, BigTable does however (for metadata). I believe newer versions (Collosus) by extension rely on Chubby as they rely on BigTable.

F1/Spanner, however, use consensus and transactions far more than others and is very interesting in this sense.

[Edit: more elaboration on distributed systems issues in a "stateless" cluster of web servers].