- Text-to-video, image-to-video, and reference-to-video generation
- First-frame and last-frame control
- 9-grid image-to-video structured input
- Support for subject plus voice reference and up to 5 video references
- Instruction-based natural language editing and video recreation
- Native audio sync and lip-sync aware audio generation
- 2–15 second duration, 1080p output