Csep-Reading-8A

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network

This paper takes us on a tour through five generations of Google's datacenter networks. The three main principles throughout are:

  • Clos topologies to for better fault tolerance
  • commodity switches for cost efficiency
  • centralized control protocols: to manage the complexity of Clos toplogies, the networks elect a single, centralized point in the network to gather and distribute network state; this way, the other individual switches can calculate forwarding tables based on a "static" topology according to the central point, rather than each node trying to manage all this dynamic state.
    I found the 6.2 Outages section most interesting. Throughout the paper it is clear that rigorous testing was employed throughout the network datacenter iterations, but of course it's impossible and/or infeasible to predict all scenarios, as this scale of datacenter can't really be fully and accurately tested in a hardware lab without prohibitive cost. For example, the failure to restart the entire fabric at once after a power outage; this kind of problem "at runtime" isn't surprising, given that shutting down and powering up the entire datacenter, or some test version if it, would be infeasible at this scale. The postmorterm resulted in stress tests in virtualized environments, which seems reasonable, but clearly this also won't catch everything. I would think TLA or some other form of formal verification is really the best solution at this scale, but I didn't see any mention of these techniques in the paper.