One-Minute Videos from Text Storyboards using Test-Time Training Transformers

2025-04-08

Current Transformer models struggle with generating one-minute videos due to the inefficiency of self-attention layers for long contexts. This paper explores Test-Time Training (TTT) layers, whose hidden states are themselves neural networks, offering greater expressiveness. Adding TTT layers to a pre-trained Transformer allows for the generation of one-minute videos from text storyboards. Experiments using a Tom and Jerry cartoon dataset show that TTT layers significantly improve video coherence and storytelling compared to baselines like Mamba 2 and Gated DeltaNet, achieving a 34 Elo point advantage in human evaluation. While artifacts remain, likely due to limitations of the 5B parameter model, this work demonstrates a promising approach scalable to longer videos and more complex narratives.