Sony AI·🎵 Music Generation

MMAudio V2

anonymized

Try on Venice.ai ↗

Quick reference

MMAudio V2 — TLDR

🎵 Generates synchronized audio and sound effects from text prompts.
👁️ Also supports video-to-audio, aligning sound to on-screen motion.
🧠 Built on multimodal joint training across audio-text and audio-visual data.
🔧 Offers negative prompts and duration control for finer output.
🎯 Default large_44k_v2 weights produce 44 kHz audio output.
🏢 Research from University of Illinois and Sony AI, presented at CVPR 2025.
📚 Trained on datasets including AudioSet, VGGSound, AudioCaps and WavCaps.

💰 Pricing

—

📅 On Venice since

Feb 28, 2026

141 days ago

Provider

Sony AI

Sony AI is the artificial intelligence research organization operating under the Sony group, applying machine learning research to areas spanning audio, imaging, and creative media. Its work reflects Sony's broader heritage in sound and entertainment…

Read full profile →

1 model on Venice

1 music

Added Feb 28, 2026

About this model

MMAudio V2 is a multimodal audio generation model that synthesizes sound from text prompts and, optionally, from video, producing tracks that align with both semantic descriptions and visual timing. Within Venice's catalog it is exposed as a text-to-audio tool: you describe an environment, material, or effect, and the model returns matching audio. The same underlying system can also add synchronized soundscapes to silent footage when given a video input.

The model traces to the CVPR 2025 research paper "MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis," whose key innovation is joint training across large-scale audio-text and audio-visual datasets to improve semantic alignment and audio-visual synchrony. It draws on corpora including AudioSet, Freesound, VGGSound, AudioCaps and WavCaps.

The V2 generation centers on the large_44k_v2 weights, the default in the official repository, producing 44 kHz audio. Because the catalog records no earlier same-family entry here, direct V1-versus-V2 comparisons are limited to these published characteristics rather than head-to-head benchmark deltas.

Practical controls include an optional negative prompt to exclude unwanted audio characteristics and duration control for varying clip lengths, with video workflows also offering a mask-away-clip option. These features make it suited to video post-production, sound design, and game-audio prototyping.

Sources

MMAudio — generating synchronized audio from video/text - a Hugging Face Space by hkchengrexhuggingface.co ↗

This About section is AI-generated from public sources (Claude Opus 4.8), with no human editing. It may contain inaccuracies — verify critical details against the sources listed above.

Data sources: Venice API · HuggingFace · Wikipedia — enrichment updated 4d ago