Cosmos 1.0 Tokenizer C V8x8x8

nvidia

Introduction

The Cosmos Tokenizer by NVIDIA is a suite of visual tokenizers designed for compressing images and videos effectively while maintaining high reconstruction quality. It serves as a crucial component in models based on diffusion and autoregressive approaches for generating visual content. The tokenizer is available for commercial use and offers significant compression benefits over existing methods.

Architecture

Cosmos Tokenizer features a lightweight, computationally efficient architecture that includes causal temporal convolution and attention layers. This design maintains the temporal sequence of video frames, allowing seamless tokenization of both images and videos. The architecture uses a symmetrical encoder-decoder pair, enhanced by a 2-level Haar wavelet transform for down-sampling. Continuous tokenizers use an autoencoder model for latent space, while discrete tokenizers incorporate Finite-Scalar-Quantization (FSQ) for quantization.

Training

The Cosmos Tokenizer models, including both continuous and discrete types, are trained using NVIDIA's advanced GPU hardware. The models achieve high compression ratios and maintain high-quality outputs, evaluated using metrics like PSNR and SSIM. The training framework and pre-trained models can be accessed and utilized for further model development or integration.

Guide: Running Locally

Basic Steps

  1. Installation

    • Clone the Cosmos-Tokenizer repository from GitHub:
      git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git
      cd Cosmos-Tokenizer
      
    • Install necessary dependencies:
      pip3 install -r requirements.txt
      apt-get install -y ffmpeg
      
    • Optionally, build a Docker image:
      docker build -t cosmos-docker -f Dockerfile.
      docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} --workdir ${PWD} cosmos-docker /bin/bash
      
  2. Download Pre-Trained Checkpoints

    • Create a local directory and download the pre-trained checkpoints. Use these checkpoints for both encoder and decoder JIT models.
  3. Run Inference

    • Execute the following code to encode and decode images or videos:
      import torch
      from cosmos_tokenizer.video_lib import CausalVideoTokenizer
      
      model_name = "Cosmos-Tokenizer-1.0-CV8x8x8"
      input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)
      encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit')
      (latent,) = encoder.encode(input_tensor)
      
      decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit')
      reconstructed_tensor = decoder.decode(latent)
      

Cloud GPUs

For optimal performance, use NVIDIA Ampere or Hopper GPUs, such as the A100 or H100, which support BF16 precision.

License

The Cosmos Tokenizer is released under the NVIDIA Open Model License. It allows commercial use, distribution of derivative models, and does not claim ownership of outputs generated using the models. Users must comply with the license terms, especially concerning safety and technical limitations.

More Related APIs