Cosmos 0.1 Tokenizer C V8x8x8

nvidia

Introduction

Cosmos Tokenizer is a suite of visual tokenizers developed by NVIDIA for images and videos, designed to provide various compression rates while maintaining high reconstruction quality. It serves as a building block in diffusion-based and autoregressive models for image and video generation, available in both continuous and discrete variants for images and videos.

Architecture

Cosmos Tokenizer employs a lightweight, efficient architecture with a temporally causal design. It uses causal temporal convolution and attention layers to maintain the temporal order of video frames. The symmetrical encoder-decoder architecture uses a 2-level Haar wavelet transform for down-sampling and its inverse for up-sampling. Continuous tokenizers model the latent space using autoencoders, while discrete tokenizers use Finite-Scalar-Quantization for quantization.

Training

The model supports spatial compression rates of 8x8 or 16x16 and temporal compression factors of 4x or 8x, achieving up to 2048x compression. It is optimized for NVIDIA Ampere and Hopper GPUs, using BF16 precision, with support for Linux operating systems. The model is evaluated against standard datasets and a custom benchmark, TokenBench, showing better performance than existing state-of-the-art methods.

Guide: Running Locally

Step 1: Installation of Cosmos-Tokenizer

  1. Clone the Cosmos-Tokenizer repository.
    git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git
    cd Cosmos-Tokenizer
    
  2. Install dependencies.
    pip3 install -r requirements.txt
    apt-get install -y ffmpeg
    

Step 2: Download Pre-Trained Checkpoints

  1. Use Hugging Face to download the pre-trained checkpoints.
    from huggingface_hub import login, snapshot_download
    import os
    
    login(token=<YOUR-HF-TOKEN>, add_to_git_credential=True)
    model_names = ["Cosmos-Tokenizer-CI8x8", "Cosmos-Tokenizer-CI16x16", "Cosmos-Tokenizer-CV4x8x8", ...]
    for model_name in model_names:
        hf_repo = "nvidia/" + model_name
        local_dir = "pretrained_ckpts/" + model_name
        os.makedirs(local_dir, exist_ok=True)
        snapshot_download(repo_id=hf_repo, local_dir=local_dir)
    

Step 3: Run Inference

  1. Encode and decode images or videos using the model.
    import torch
    from cosmos_tokenizer.video_lib import CausalVideoTokenizer
    
    model_name = "Cosmos-Tokenizer-CV4x8x8"
    input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)
    encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')
    latent = encoder.encode(input_tensor)
    decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')
    reconstructed_tensor = decoder.decode(latent)
    

Cloud GPUs: For enhanced performance, consider using NVIDIA A100 or H100 GPUs available on cloud platforms.

License

The Cosmos Tokenizer is released under the NVIDIA Open Model License, allowing commercial use and distribution of derivative models. NVIDIA does not claim ownership of any outputs generated using these models. The full license can be found here.

More Related APIs