AutoMV: An Automatic Multi-Agent System for Music Video Generation

Tang, Xiaoxuan; Lei, Xinping; Zhu, Chaoran; Chen, Shiyun; Yuan, Ruibin; Li, Yizhi; Oh, Changjae; Zhang, Ge; Huang, Wenhao; Benetos, Emmanouil; Liu, Yang; Liu, Jiaheng; Ma, Yinghao

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Transforming full songs into coherent, beat-aligned music videos through collaborative agents and MIR-driven planning.

Xiaoxuan Tang^*, Xinping Lei^*, Chaoran Zhu^*, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu^†, Jiaheng Liu^†, Yinghao Ma^†

BUPT · Nanjing University · Queen Mary University of London · HKUST · University of Manchester
^*Equal Contribution ^†Corresponding Authors

Paper Demo Code

AutoMV orchestrates music understanding, scriptwriting, directing, and verification agents to deliver full-length, human-centric music videos that stay faithful to rhythm, structure, and lyrical semantics—without human prompting.

Abstract

Music-to-video generation at song length remains unsolved because current systems fail to align visuals with long-range musical structure, enforce character consistency, or reason about lyrical intent. AutoMV is the first automatic, fully open multi-agent system that takes raw audio and time-stamped lyrics as input and outputs an entire music video without manual curation.

The pipeline begins with music information retrieval that extracts beats, sections, stems, and aligned lyrics. Dedicated Screenwriter and Director agents—powered by Gemini models—co-author a scene-by-scene script, create character bibles, and issue camera instructions. Specialized generation backends produce both "story" and "performance" shots, while a Verifier agent enforces factual alignment and iteratively requests revisions to maintain temporal coherence, lip sync, and visual quality.

To evaluate this long-form task, we release AutoMV-Bench, a benchmark of 60 songs scored by expert raters across four high-level dimensions and twelve granular criteria. We further design LLM-Score, an LLM-based judge that correlates strongly with human ratings and enables scalable assessment. AutoMV substantially outperforms commercial baselines on every category, narrowing the gap to human-directed productions.

Key Contributions

End-to-End Multi-Agent Pipeline

AutoMV connects MIR parsing, screenwriting, directing, generation, and verification agents so that a full music video can be produced directly from audio and lyrics.

Music-Aware Planning

Our agents leverage beat, structure, and lyric cues to design camera moves, shot types, and character profiles that stay synchronized with the soundtrack.

Iterative Verifier

A Gemini-based Verifier agent automatically checks alignment, physical feasibility, and continuity, requesting reshoots until the script is satisfied.

AutoMV-Bench & LLM-Score

We release the first benchmark for music-to-video quality along with LLM-Score, an automatic judge that approximates expert evaluation across twelve criteria.

System Overview

AutoMV starts from raw audio, executes music information retrieval to obtain beat grids, structure, and lyric timestamps, and dispatches the data to collaborating agents. The Screenwriter agent produces a storyboard and dialogue-level prompts, while the Director agent assigns shot lists, character poses, and camera directives. Their outputs populate a persistent character bank and scene graph used to query multiple video backends—including diffusion-based generators for story segments and talking/singing avatar models for performances.

Generated clips loop through a Verifier agent that critiques synchronization, continuity, and cinematic quality. Failed shots are sent back for targeted regeneration, and the surviving clips are automatically edited into a cohesive music video with transitions and subtitles aligned to the lyrics.

MIR Module: Beat tracking, vocal separation, section segmentation, and lyric alignment.
Creative Agents: Screenwriter & Director maintain shared context for characters, scenes, and camera plans.
Generation Hub: Story-image diffusion, singing avatar, and choreography models coordinated per shot type.
Verifier Loop: Gemini-based critic ensures audiovisual alignment and requests reshoots when necessary.

AutoMV system diagram showing multi-agent workflow

Production Efficiency

AutoMV collapses the traditional, labor-heavy music video pipeline into a compact agentic workflow. The chart contrasts a human crew—spanning scriptwriters, directors, actors, editors, and moderators—with AutoMV's MIR, VLM, and generation modules.

120 hours of manual coordination shrink to 0.5 hour of automated processing.
Budgets drop from $10k to roughly $15 of compute.
Quality moves from 2.9/5 human baselines to 2.4/5 without any manual touch, enabling rapid iteration.

These savings let artists and studios prototype full music videos at a fraction of today's cost while keeping creative control over prompts and lyrical direction.

Comparison chart showing human production requiring 120 hours and $10k versus AutoMV taking 0.5 hour and $15 with competitive quality.

Quality Against Baselines

AutoMV outperforms commercial systems such as Revid.ai-base and OpenArt-story across every AutoMV-Bench dimension, approaching expert-directed productions. The table below summarizes cost, generation time, beat alignment (IB), and four category sub-metrics spanning Music Content (T_E, P_O, C_O, A_R) and Human Study scores.

Benchmark table comparing AutoMV with commercial baselines and human experts across cost, time, and evaluation metrics.

To ensure that automatic metrics mirror expert judgement, we correlate LLM-Score with human annotations. Gemini 3.5 Pro-Preview delivers the strongest alignment—reaching up to 0.74 on performance artistry—suggesting that our benchmark faithfully reflects human preferences across the twelve criteria.

Heatmap of Pearson correlation coefficients between human raters and multimodal models across AutoMV-Bench metrics.

Quality Demos

Lazy Song

灰色

Golden Hour

AutoMV achieves the highest IB score (24.4%) while maintaining competitive Te_G, Po_G, Co_G, and Ar_G ratings. Ablations confirm that lyrics grounding, the character bank, and the verifier agent each contribute measurable gains, pushing AutoMV to a 2.42 expert score—closing most of the gap to human-directed references.

Results at a Glance

Beat alignment: +9.7 improvement over Pika Video per expert study.
Lyric faithfulness: 86% of AutoMV shots accurately depict lyrical content vs. 41% for baselines.
Continuity: Verifier-driven reshoots cut character drift errors by 62%.
User study: 78% of raters prefer AutoMV over commercial alternatives for storytelling depth.

Qualitatively, AutoMV maintains protagonists and wardrobes across long scenes, integrates choreography cues with percussive beats, and edits transitions using the learned structure chart. Please explore additional examples in the video carousel above and in the project repository.

BibTeX

@misc{tang2025automvautomaticmultiagentmusic,
      title={AutoMV: An Automatic Multi-Agent System for Music Video Generation}, 
      author={Xiaoxuan Tang and Xinping Lei and Chaoran Zhu and Shiyun Chen and Ruibin Yuan and Yizhi Li and Changjae Oh and Ge Zhang and Wenhao Huang and Emmanouil Benetos and Yang Liu and Jiaheng Liu and Yinghao Ma},
      year={2025},
      eprint={2512.12196},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2512.12196}, 
}

AutoMV

AutoMV: An Automatic Multi-Agent System for Music Video Generation

AutoMV orchestrates music understanding, scriptwriting, directing, and verification agents to deliver full-length, human-centric music videos that stay faithful to rhythm, structure, and lyrical semantics—without human prompting.

Abstract

Key Contributions

End-to-End Multi-Agent Pipeline

Music-Aware Planning

Iterative Verifier

AutoMV-Bench & LLM-Score

System Overview

Production Efficiency

Quality Against Baselines

Quality Demos

Results at a Glance

More Video Examples

BibTeX