Mixture of Experts (MoE) is a deep learning technique that defies the traditional use of same parameters for all inputs, instead selecting different parameters for each incoming example. This results in a sparsely-activated model with an outrageous number of parameters but constant computational cost. Despite several successes, MoE has not been widely adopted due to complexity, communication costs and training instability. The Switch Transformer addresses these issues by simplifying the routing algorithm and designing intuitive models with reduced communication and computational costs. It also introduces improved training techniques to make large sparse models more stable when trained with lower precision (bfloat16) formats. Using T5-Base and T5-Large as base models, up to 7x increases in pre-training speed can be achieved with the same computational resources — this extends into multilingual settings where gains are measured across all 101 languages over mT5-Base version. Finally, trillion parameter language models have been pre-trained on “Colossal Clean Crawled Corpus” providing 4x speedup over T5-XXL model – read more at https://arxiv.org/abs/2101.03961