Microsoft reveals VALL-E 2 AI, achieving human-like speech
Tests on datasets like LibriSpeech and VCTK show that VALL-E 2’s voice quality matches or exceeds that of human speech.
Microsoft has made a significant leap forward in AI speech generation with its VALL-E 2 text-to-speech (TTS) system. VALL-E 2 achieves human parity, meaning it can produce voices indistinguishable from real people. The system only needs a few seconds of audio to learn and mimic a speaker’s voice.
Tests on speech datasets like LibriSpeech and VCTK showed that VALL-E 2’s voice quality matches or even surpasses human quality. Features like ‘Repetition Aware Sampling’ and ‘Grouped Code Modeling’ allow the system to handle complex sentences and repetitive phrases naturally, ensuring smooth and realistic speech output.
Despite releasing audio samples, Microsoft considers VALL-E 2 too advanced for public release due to potential misuse like voice spoofing. This cautious approach aligns with the wider industry’s concerns, as seen with OpenAI’s restrictions on its voice technology.
While VALL-E 2 represents a significant breakthrough, it remains a research project for now. The development of AI continues apace, with companies striving to balance innovation with ethical considerations.