kandinsky 4 v2a

ai-forever

Introduction

KANDINSKY-4-V2A is a video-to-audio pipeline that transforms video inputs into audio outputs. It utilizes a visual encoder, a text encoder, a UNet diffusion model for spectrogram generation, and the Griffin-Lim algorithm for converting spectrograms into audio. The system features a multimodal visual language decoder for both visual and text encoding.

Architecture

The pipeline consists of the following components:

  • Visual Encoder: Processes video frames.
  • Text Encoder: Encodes text prompts.
  • UNet Diffusion Model: Finetuned from the riffusion music generation model, adapted to video frames for improved synchronization between video and audio.
  • Griffin-Lim Algorithm: Converts spectrograms into audio.
  • Shared Multimodal Visual Language Decoder: Utilizes the cogvlm2-video-llama3-chat model for both visual and text encoding.

Training

The UNet diffusion model is finetuned to condition on video frames. Architectural modifications include replacing the typical text encoder with the decoder from cogvlm2-video-llama3-chat to enhance synchronization between video and audio.

Guide: Running Locally

  1. Clone the Repository:
    git clone https://github.com/ai-forever/Kandinsky-4.git
    cd Kandinsky-4
    
  2. Install Dependencies:
    conda install -c conda-forge ffmpeg -y
    pip install -r kandinsky4_video2audio/requirements.txt
    pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
    
  3. Run Inference:
    import torch
    import torchvision
    from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline
    from kandinsky4_video2audio.utils import load_video, create_video
    
    device='cuda:0'
    
    pipe = Video2AudioPipeline(
        "ai-forever/kandinsky-4-v2a",
        torch_dtype=torch.float16,
        device = device
    )
    
    video_path = 'assets/inputs/1.mp4'
    video, _, fps = torchvision.io.read_video(video_path)
    
    prompt="clean. clear. good quality."
    negative_prompt = "hissing noise. drumming rhythm. saying. poor quality."
    video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12)
        
    out = pipe(
        video_input,
        prompt,
        negative_prompt=negative_prompt,
        duration_sec=duration_sec, 
    )[0]
    
    save_path = f'assets/outputs/1.mp4'
    create_video(
        out, 
        video_complete, 
        display_video=True,
        save_path=save_path,
        device=device
    )
    

For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

This project is licensed under the Apache-2.0 License.

More Related APIs