Cosmos 0.1 Tokenizer D V4x8x8

nvidia

Introduction

Cosmos Tokenizer is a suite of visual tokenizers developed by NVIDIA for images and videos. It offers various compression rates while maintaining high reconstruction quality, making it suitable for both diffusion-based and autoregressive models used in image and video generation.

Architecture

Cosmos Tokenizer employs a lightweight, computationally efficient architecture. It includes causal temporal convolution and attention layers to maintain the temporal order of video frames. The encoder and decoder are symmetrical, using a 2-level Haar wavelet transform for down-sampling and inverse transformation for reconstruction. Continuous tokenizers utilize a vanilla autoencoder, while discrete tokenizers use Finite-Scalar-Quantization for latent space quantization.

Training

The Cosmos Tokenizer suite includes continuous and discrete tokenizers for both images and videos, with various spatial and temporal compression factors. These tokenizers achieve significantly higher compression rates than state-of-the-art methods while maintaining better image quality and faster processing speeds.

Guide: Running Locally

Step 1: Installation

  1. Clone the Cosmos-Tokenizer repository:

    git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git
    cd Cosmos-Tokenizer
    
  2. Install dependencies:

    pip3 install -r requirements.txt
    apt-get install -y ffmpeg
    
  3. Build a Docker image (optional):

    docker build -t cosmos-docker -f Dockerfile
    docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} --workdir ${PWD} cosmos-docker /bin/bash
    

Step 2: Download Pre-Trained Checkpoints

  1. Use Hugging Face to download pre-trained model checkpoints:
    from huggingface_hub import login, snapshot_download
    import os
    
    login(token=<YOUR-HF-TOKEN>, add_to_git_credential=True)
    model_names = [
        "Cosmos-Tokenizer-CI8x8", "Cosmos-Tokenizer-CI16x16", "Cosmos-Tokenizer-CV4x8x8",
        "Cosmos-Tokenizer-CV8x8x8", "Cosmos-Tokenizer-CV8x16x16", "Cosmos-Tokenizer-DI8x8",
        "Cosmos-Tokenizer-DI16x16", "Cosmos-Tokenizer-DV4x8x8", "Cosmos-Tokenizer-DV8x8x8",
        "Cosmos-Tokenizer-DV8x16x16",
    ]
    for model_name in model_names:
        hf_repo = "nvidia/" + model_name
        local_dir = "pretrained_ckpts/" + model_name
        os.makedirs(local_dir, exist_ok=True)
        snapshot_download(repo_id=hf_repo, local_dir=local_dir)
    

Step 3: Run Inference

  1. Encode and decode images or videos:
    import torch
    from cosmos_tokenizer.video_lib import CausalVideoTokenizer
    
    model_name = "Cosmos-Tokenizer-DV4x8x8"
    input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)
    encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')
    (indices, codes) = encoder.encode(input_tensor)
    
    decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')
    reconstructed_tensor = decoder.decode(indices)
    

For optimal performance, it's recommended to use NVIDIA's Ampere or Hopper GPUs, such as the A100 or H100.

License

The Cosmos Tokenizer is released under the NVIDIA Open Model License. This license allows for commercial use, creation and distribution of derivative models, and does not claim ownership of outputs generated using the models. More details can be found in the NVIDIA Open Model License Agreement.

More Related APIs