MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Abstract: Sounding Video Generation (SVG) is a challenging multi-modal generation task since it requires synthesizing realistic high-dimensional video and audio signals with limited computational resources, and bridging the gap between data representation and conveyed messages to obtain content consistency. In this paper, we introduce a novel multi-modal latent diffusion model (MM-LDM) to solve this task. We first unify the representation of audio and video signals by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces the computational complexity. The latter space is derived from the perceptual spaces and provides more insightful cross-modal guidance to bridge the information gap between modalities. In addition, our method can be extended to synthesize one modality based on another by cross-modal sampling guidance. We obtain the new state-of-the-art results with significant quality and efficiency gains. In particular, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets.

Videos presented on this page are encoded according to H.264 format, which can be displayed through Google Chrome.

Longer Sounding Videos (1 minute, 608 frames)

Zero-shot Conditional Generation

Video-to-audio Generation on AIST++

Audio-to-video Generation on Landscape

Sounding Video Generation on AIST++ 256x256

MM-Diffusion

MM-Diffusion + SR

MM-LDM

Sounding Video Generation on Landscape 256x256

MM-Diffusion

MM-Diffusion + SR

MM-LDM

Conditional Generation

Video-to-audio Generation on AIST++

Audio-to-video Generation on Landscape

Long Sounding Video Generation

Video Continuation

Audio Continuation