VibeVoice: Open-Source Long-Form, Multi-Speaker TTS

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

VibeVoice: Open-Source Long-Form, Multi-Speaker TTS

2025-09-03

VibeVoice is a novel open-source framework for generating expressive, long-form, multi-speaker conversational audio like podcasts from text. It tackles challenges in traditional TTS, such as scalability, speaker consistency, and natural turn-taking. Key innovation includes ultra-low frame rate (7.5 Hz) continuous speech tokenizers (acoustic and semantic) which maintain audio fidelity while boosting efficiency for long sequences. It uses a next-token diffusion framework with an LLM for context understanding and a diffusion head for high-fidelity audio generation. VibeVoice can synthesize up to 90 minutes of speech with 4 distinct speakers, exceeding the limitations of many existing models.

(microsoft.github.io)

Tencent's HunyuanWorld-Voyager: World-Consistent 3D Video Generation from a Single Image

Dell's AI Server Business Explodes: Riding the Generative AI Wave