Abstract
Music-to-video generation at song length remains unsolved because current systems fail to align visuals with long-range musical structure, enforce character consistency, or reason about lyrical intent. AutoMV is the first automatic, fully open multi-agent system that takes raw audio and time-stamped lyrics as input and outputs an entire music video without manual curation.
The pipeline begins with music information retrieval that extracts beats, sections, stems, and aligned lyrics. Dedicated Screenwriter and Director agents—powered by Gemini models—co-author a scene-by-scene script, create character bibles, and issue camera instructions. Specialized generation backends produce both "story" and "performance" shots, while a Verifier agent enforces factual alignment and iteratively requests revisions to maintain temporal coherence, lip sync, and visual quality.
To evaluate this long-form task, we release AutoMV-Bench, a benchmark of 60 songs scored by expert raters across four high-level dimensions and twelve granular criteria. We further design LLM-Score, an LLM-based judge that correlates strongly with human ratings and enables scalable assessment. AutoMV substantially outperforms commercial baselines on every category, narrowing the gap to human-directed productions.
Key Contributions
End-to-End Multi-Agent Pipeline
AutoMV connects MIR parsing, screenwriting, directing, generation, and verification agents so that a full music video can be produced directly from audio and lyrics.
Music-Aware Planning
Our agents leverage beat, structure, and lyric cues to design camera moves, shot types, and character profiles that stay synchronized with the soundtrack.
Iterative Verifier
A Gemini-based Verifier agent automatically checks alignment, physical feasibility, and continuity, requesting reshoots until the script is satisfied.
AutoMV-Bench & LLM-Score
We release the first benchmark for music-to-video quality along with LLM-Score, an automatic judge that approximates expert evaluation across twelve criteria.
System Overview
AutoMV starts from raw audio, executes music information retrieval to obtain beat grids, structure, and lyric timestamps, and dispatches the data to collaborating agents. The Screenwriter agent produces a storyboard and dialogue-level prompts, while the Director agent assigns shot lists, character poses, and camera directives. Their outputs populate a persistent character bank and scene graph used to query multiple video backends—including diffusion-based generators for story segments and talking/singing avatar models for performances.
Generated clips loop through a Verifier agent that critiques synchronization, continuity, and cinematic quality. Failed shots are sent back for targeted regeneration, and the surviving clips are automatically edited into a cohesive music video with transitions and subtitles aligned to the lyrics.
- MIR Module: Beat tracking, vocal separation, section segmentation, and lyric alignment.
- Creative Agents: Screenwriter & Director maintain shared context for characters, scenes, and camera plans.
- Generation Hub: Story-image diffusion, singing avatar, and choreography models coordinated per shot type.
- Verifier Loop: Gemini-based critic ensures audiovisual alignment and requests reshoots when necessary.
Production Efficiency
AutoMV collapses the traditional, labor-heavy music video pipeline into a compact agentic workflow. The chart contrasts a human crew—spanning scriptwriters, directors, actors, editors, and moderators—with AutoMV's MIR, VLM, and generation modules.
- 120 hours of manual coordination shrink to 0.5 hour of automated processing.
- Budgets drop from $10k to roughly $15 of compute.
- Quality moves from 2.9/5 human baselines to 2.4/5 without any manual touch, enabling rapid iteration.
These savings let artists and studios prototype full music videos at a fraction of today's cost while keeping creative control over prompts and lyrical direction.
Quality Against Baselines
AutoMV outperforms commercial systems such as Revid.ai-base and OpenArt-story across every AutoMV-Bench dimension, approaching expert-directed productions. The table below summarizes cost, generation time, beat alignment (IB), and four category sub-metrics spanning Music Content (TE, PO, CO, AR) and Human Study scores.
To ensure that automatic metrics mirror expert judgement, we correlate LLM-Score with human annotations. Gemini 3.5 Pro-Preview delivers the strongest alignment—reaching up to 0.74 on performance artistry—suggesting that our benchmark faithfully reflects human preferences across the twelve criteria.
Quality Demos
Lazy Song
灰色
Golden Hour
AutoMV achieves the highest IB score (24.4%) while maintaining competitive TeG, PoG, CoG, and ArG ratings. Ablations confirm that lyrics grounding, the character bank, and the verifier agent each contribute measurable gains, pushing AutoMV to a 2.42 expert score—closing most of the gap to human-directed references.
Results at a Glance
- Beat alignment: +9.7 improvement over Pika Video per expert study.
- Lyric faithfulness: 86% of AutoMV shots accurately depict lyrical content vs. 41% for baselines.
- Continuity: Verifier-driven reshoots cut character drift errors by 62%.
- User study: 78% of raters prefer AutoMV over commercial alternatives for storytelling depth.
Qualitatively, AutoMV maintains protagonists and wardrobes across long scenes, integrates choreography cues with percussive beats, and edits transitions using the learned structure chart. Please explore additional examples in the video carousel above and in the project repository.
More Video Examples
"Believer" — Excellent Character Consistency and Storytelling.
"APT." — Diverse visuals with excellent audio-visual matching.
"灰色" — Compatible with multiple styles and languages.
"Lazy Song" — Excellent storytelling and diverse settings.
BibTeX
@misc{tang2025automvautomaticmultiagentmusic,
title={AutoMV: An Automatic Multi-Agent System for Music Video Generation},
author={Xiaoxuan Tang and Xinping Lei and Chaoran Zhu and Shiyun Chen and Ruibin Yuan and Yizhi Li and Changjae Oh and Ge Zhang and Wenhao Huang and Emmanouil Benetos and Yang Liu and Jiaheng Liu and Yinghao Ma},
year={2025},
eprint={2512.12196},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2512.12196},
}