Dia: A 1.6B Parameter Text-to-Speech Model from Nari Labs

2025-04-21
Dia: A 1.6B Parameter Text-to-Speech Model from Nari Labs

Nari Labs introduces Dia, a 1.6B parameter text-to-speech model capable of generating highly realistic dialogue directly from transcripts. Users can control emotion and tone by conditioning the output on audio, and the model even produces nonverbal cues like laughter and coughs. To accelerate research, pretrained model checkpoints and inference code are available on Hugging Face. A demo page compares Dia to ElevenLabs Studio and Sesame CSM-1B. While currently requiring around 10GB VRAM and GPU support (CPU support coming soon), Dia generates roughly 40 tokens/second on an A4000 GPU. A quantized version is planned for improved memory efficiency. The model is licensed under Apache License 2.0 and strictly prohibits misuse such as identity theft, generating deceptive content, or illegal activities.

AI