INFP: An Audio-Driven Interactive Head Generation Framework for Natural Dyadic Conversations

2024-12-22

ByteDance introduces INFP, a novel audio-driven interactive head generation framework. Given dual-track audio from a dyadic conversation and a single portrait image, INFP dynamically synthesizes realistic agent videos with verbal, nonverbal, and interactive cues, including lifelike facial expressions and head movements. The lightweight framework is ideal for real-time communication like video conferencing. INFP uses a two-stage process: Motion-Based Head Imitation and Audio-Guided Motion Generation. The first stage projects facial communicative behaviors into a low-dimensional latent space, while the second maps dyadic audio to these codes, enabling audio-driven generation. A new large-scale dyadic conversation dataset, DyConv, is also introduced. INFP achieves superior performance and natural interaction.

AI