Google DeepMind’s Decoupled DiLoCo Trains LLMs Across Data Centers with Minimal Bandwidth

Google DeepMind has released a technical paper describing Decoupled DiLoCo (Distributed Low-Communication), a new approach to training large language models across geographically distributed data centers. The architecture is designed to address two persistent problems in large-scale AI training: the fragility of tightly synchronized compute clusters and the high bandwidth costs of coordinating chips across wide-area networks.

How It Works

Conventional training runs require thousands of chips to stay in near-constant synchronization. Any single hardware failure can stall the entire job. Decoupled DiLoCo breaks a training run into separate “islands” of compute, called learner units, that operate asynchronously. A failure within one island does not halt the others, and when a failed unit recovers it is reintegrated automatically into the ongoing training run.

The architecture builds on two prior Google research efforts. Pathways provided the underlying asynchronous data-flow infrastructure, while DiLoCo established techniques for reducing the inter-datacenter bandwidth needed to coordinate distributed training. Decoupled DiLoCo combines both.

Benchmark Results

Google DeepMind validated the approach by training a 12-billion-parameter model across four separate U.S. regions using between 2 and 5 Gbps of wide-area network bandwidth, a level achievable with standard inter-datacenter internet connectivity rather than purpose-built private links. The team reports the system completed training more than 20 times faster than conventional synchronization methods, because required communication is folded into longer compute periods rather than blocking forward progress.

Real-world experiments using Gemma 4 models showed that models trained with Decoupled DiLoCo matched the benchmark performance of models trained with conventional methods. Chaos engineering tests, in which artificial hardware failures were deliberately introduced, confirmed that the system maintained high “goodput” (useful training throughput) even as traditional approaches degraded sharply under the same failure conditions.

Mixed-Hardware Training

One notable capability of the architecture is the ability to mix different hardware generations, such as TPU v6e and TPU v5p, within a single training run. According to the research, chips running at different speeds from different generations still matched the ML performance of runs using a single chip type. The team notes this extends the useful life of older hardware and helps relieve capacity bottlenecks that arise when new hardware generations roll out unevenly across facilities.

Broader Implications

By reducing bandwidth requirements by orders of magnitude compared to conventional methods, Decoupled DiLoCo opens the possibility of using otherwise idle or stranded compute resources wherever they are located. The research positions this as a step toward training infrastructure that can scale across internet-connected facilities without requiring specialized inter-site networking, with fault tolerance built into the architecture rather than bolted on after the fact.

The full technical report has been published by Google DeepMind and Google Research.

Google DeepMind’s Decoupled DiLoCo Trains LLMs Across Data Centers with Minimal Bandwidth

How It Works

Benchmark Results

Mixed-Hardware Training

Broader Implications

THE 0600 BRIEF