kandinsky 4 v2a
ai-foreverIntroduction
KANDINSKY-4-V2A is a video-to-audio pipeline that transforms video inputs into audio outputs. It utilizes a visual encoder, a text encoder, a UNet diffusion model for spectrogram generation, and the Griffin-Lim algorithm for converting spectrograms into audio. The system features a multimodal visual language decoder for both visual and text encoding.
Architecture
The pipeline consists of the following components:
- Visual Encoder: Processes video frames.
- Text Encoder: Encodes text prompts.
- UNet Diffusion Model: Finetuned from the riffusion music generation model, adapted to video frames for improved synchronization between video and audio.
- Griffin-Lim Algorithm: Converts spectrograms into audio.
- Shared Multimodal Visual Language Decoder: Utilizes the
cogvlm2-video-llama3-chat
model for both visual and text encoding.
Training
The UNet diffusion model is finetuned to condition on video frames. Architectural modifications include replacing the typical text encoder with the decoder from cogvlm2-video-llama3-chat
to enhance synchronization between video and audio.
Guide: Running Locally
- Clone the Repository:
git clone https://github.com/ai-forever/Kandinsky-4.git cd Kandinsky-4
- Install Dependencies:
conda install -c conda-forge ffmpeg -y pip install -r kandinsky4_video2audio/requirements.txt pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
- Run Inference:
import torch import torchvision from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline from kandinsky4_video2audio.utils import load_video, create_video device='cuda:0' pipe = Video2AudioPipeline( "ai-forever/kandinsky-4-v2a", torch_dtype=torch.float16, device = device ) video_path = 'assets/inputs/1.mp4' video, _, fps = torchvision.io.read_video(video_path) prompt="clean. clear. good quality." negative_prompt = "hissing noise. drumming rhythm. saying. poor quality." video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12) out = pipe( video_input, prompt, negative_prompt=negative_prompt, duration_sec=duration_sec, )[0] save_path = f'assets/outputs/1.mp4' create_video( out, video_complete, display_video=True, save_path=save_path, device=device )
For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
This project is licensed under the Apache-2.0 License.