Imagen is a revolutionary text-to-image model that produces incredibly realistic photos and has an advanced capability of comprehending natural language. It combines the power of transformer language models, which are trained on text datasets, with diffusion models in high-resolution image generation. To our surprise, generic large language models (e.g., T5) can encode texts for image synthesis remarkably well: compared to increasing the size of the image diffusion model, enhancing the size of the language model in Imagen leads to better sample quality and more accurate alignment between images and text.