VibeVoice: Open-Source Long-Form, Multi-Speaker TTS
2025-09-03
VibeVoice is a novel open-source framework for generating expressive, long-form, multi-speaker conversational audio like podcasts from text. It tackles challenges in traditional TTS, such as scalability, speaker consistency, and natural turn-taking. Key innovation includes ultra-low frame rate (7.5 Hz) continuous speech tokenizers (acoustic and semantic) which maintain audio fidelity while boosting efficiency for long sequences. It uses a next-token diffusion framework with an LLM for context understanding and a diffusion head for high-fidelity audio generation. VibeVoice can synthesize up to 90 minutes of speech with 4 distinct speakers, exceeding the limitations of many existing models.
AI