GLaM is a type of model composed of different submodels (or experts) that specialize in processing specific inputs. For each input token, the gating network selects two expert models to process it. The full version of GLaM has 1.2T total parameters spread out across 64 experts per layer with 32 layers in total, however only 97B (8% of 1.2T) are used during inference for each token prediction.