Cosmos 1.0 Tokenizer C V8x8x8
nvidiaIntroduction
The Cosmos Tokenizer by NVIDIA is a suite of visual tokenizers designed for compressing images and videos effectively while maintaining high reconstruction quality. It serves as a crucial component in models based on diffusion and autoregressive approaches for generating visual content. The tokenizer is available for commercial use and offers significant compression benefits over existing methods.
Architecture
Cosmos Tokenizer features a lightweight, computationally efficient architecture that includes causal temporal convolution and attention layers. This design maintains the temporal sequence of video frames, allowing seamless tokenization of both images and videos. The architecture uses a symmetrical encoder-decoder pair, enhanced by a 2-level Haar wavelet transform for down-sampling. Continuous tokenizers use an autoencoder model for latent space, while discrete tokenizers incorporate Finite-Scalar-Quantization (FSQ) for quantization.
Training
The Cosmos Tokenizer models, including both continuous and discrete types, are trained using NVIDIA's advanced GPU hardware. The models achieve high compression ratios and maintain high-quality outputs, evaluated using metrics like PSNR and SSIM. The training framework and pre-trained models can be accessed and utilized for further model development or integration.
Guide: Running Locally
Basic Steps
-
Installation
- Clone the Cosmos-Tokenizer repository from GitHub:
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git cd Cosmos-Tokenizer
- Install necessary dependencies:
pip3 install -r requirements.txt apt-get install -y ffmpeg
- Optionally, build a Docker image:
docker build -t cosmos-docker -f Dockerfile. docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} --workdir ${PWD} cosmos-docker /bin/bash
- Clone the Cosmos-Tokenizer repository from GitHub:
-
Download Pre-Trained Checkpoints
- Create a local directory and download the pre-trained checkpoints. Use these checkpoints for both encoder and decoder JIT models.
-
Run Inference
- Execute the following code to encode and decode images or videos:
import torch from cosmos_tokenizer.video_lib import CausalVideoTokenizer model_name = "Cosmos-Tokenizer-1.0-CV8x8x8" input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16) encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit') (latent,) = encoder.encode(input_tensor) decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit') reconstructed_tensor = decoder.decode(latent)
- Execute the following code to encode and decode images or videos:
Cloud GPUs
For optimal performance, use NVIDIA Ampere or Hopper GPUs, such as the A100 or H100, which support BF16 precision.
License
The Cosmos Tokenizer is released under the NVIDIA Open Model License. It allows commercial use, distribution of derivative models, and does not claim ownership of outputs generated using the models. Users must comply with the license terms, especially concerning safety and technical limitations.