- Synchronized single-pass audio + video generation
- Native 4K output at up to 50 FPS
- Multimodal conditioning: text, image, video, depth, keyframes
- 19B-parameter DiT architecture (14B video + 5B audio)
- Apache 2.0 open-source license with model weights and code
- Text-to-Video and Image-to-Video generation modes
- Optimizations for efficient inference (NVFP4/NVFP8)