Cosmos 1.0 Tokenizer D V8x16x16

nvidia

Introduction

The Cosmos Tokenizer is a suite of visual tokenizers designed by NVIDIA for compressing images and videos. It provides high reconstruction quality and is suitable for both diffusion-based and autoregressive models. The tool can be used commercially and supports various compression rates.

Architecture

The Cosmos Tokenizer features a lightweight architecture with a temporally causal design using causal temporal convolution and attention layers. The encoder and decoder form a symmetrical pair, employing a 2-level Haar wavelet transform to down-sample inputs. The encoder models continuous latents using an autoencoder, while discrete tokenizers utilize Finite-Scalar-Quantization (FSQ).

Training

Cosmos Tokenizer comes with pre-trained models that offer different types of tokenizers for both continuous and discrete data. The models are designed to compress visual data efficiently, achieving up to 2048x total compression factors, outperforming state-of-the-art methods in speed and quality.

Guide: Running Locally

  1. Installation: Clone the Cosmos-Tokenizer repository from GitHub and install the necessary dependencies.

    git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git
    cd Cosmos-Tokenizer
    pip3 install -r requirements.txt
    apt-get install -y ffmpeg
    

    Optionally, build and run a Docker container:

    docker build -t cosmos-docker -f Dockerfile.
    docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} --workdir ${PWD} cosmos-docker /bin/bash
    
  2. Pre-trained Checkpoints: Download the pre-trained models from Hugging Face.

    from huggingface_hub import login, snapshot_download
    import os
    login(token=<YOUR-HF-TOKEN>, add_to_git_credential=True)
    model_names = ["Cosmos-1.0-Tokenizer-DV8x16x16"]
    for model_name in model_names:
        hf_repo = "nvidia/" + model_name
        local_dir = "pretrained_ckpts/" + model_name
        os.makedirs(local_dir, exist_ok=True)
        snapshot_download(repo_id=hf_repo, local_dir=local_dir)
    
  3. Inference: Run the tokenizer for encoding and decoding.

    import torch
    from cosmos_tokenizer.video_lib import CausalVideoTokenizer
    model_name = "Cosmos-1.0-Tokenizer-DV8x16x16"
    input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)
    encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')
    indices, codes = encoder.encode(input_tensor)
    decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')
    reconstructed_tensor = decoder.decode(indices)
    

Suggested Cloud GPUs: The model is compatible with NVIDIA Ampere (e.g., A100) and NVIDIA Hopper (e.g., H100) GPUs, using BF16 precision. It's recommended to use these models for optimal performance.

License

Cosmos Tokenizer is licensed under the NVIDIA Open Model License. This license allows for commercial use, the creation of derivative models, and does not claim ownership of outputs. Breaching any technical limitations will result in automatic termination of the license. For custom licensing, contact cosmos-license@nvidia.com.

More Related APIs