Microsoft has developed a new Text-to-Speech model, VALL-E (Voice Audio Language Learning Encoder), that can maintain the emotional tone and acoustic environment of a speaker. This technology is based off of EnCodec, which was announced by Meta in October 2022. Unlike other text-to-speech methods that manipulate waveforms to create speech, VALL-E creates discrete audio codecs from both text and sound prompts. It processes how a person sounds into tokens through Encodec and uses training data to match what it “knows” about how that voice would sound for different phrases. This technology carries potential risks such as spoofing voice identification or impersonating someone else’s speech; however, Microsoft AI Principles will be applied when further developing VALL-E to mitigate these risks by building detection models to discriminate whether an audio clip was synthesized or not.