In this paper, Google presents their work on scaling giant language translation models with 600 billion parameters trained on 2048 TPU v3 cores. To cope with the challenges of training these large scale models such as computation cost, ease of programming and efficient implementation on parallel devices, GShard was developed. This is a module composed of lightweight annotation APIs and an extension to the XLA compiler that enables us to express a wide range of parallel computations patterns with minimal changes to existing model code. With GShard we were able to train a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding in just 4 days on 2048 TPU v3 accelerators, achieving far superior quality for translations from 100 languages into English compared to previous methods.