Cosmos 0.1 Tokenizer D V4x8x8
nvidiaIntroduction
Cosmos Tokenizer is a suite of visual tokenizers developed by NVIDIA for images and videos. It offers various compression rates while maintaining high reconstruction quality, making it suitable for both diffusion-based and autoregressive models used in image and video generation.
Architecture
Cosmos Tokenizer employs a lightweight, computationally efficient architecture. It includes causal temporal convolution and attention layers to maintain the temporal order of video frames. The encoder and decoder are symmetrical, using a 2-level Haar wavelet transform for down-sampling and inverse transformation for reconstruction. Continuous tokenizers utilize a vanilla autoencoder, while discrete tokenizers use Finite-Scalar-Quantization for latent space quantization.
Training
The Cosmos Tokenizer suite includes continuous and discrete tokenizers for both images and videos, with various spatial and temporal compression factors. These tokenizers achieve significantly higher compression rates than state-of-the-art methods while maintaining better image quality and faster processing speeds.
Guide: Running Locally
Step 1: Installation
-
Clone the Cosmos-Tokenizer repository:
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git cd Cosmos-Tokenizer
-
Install dependencies:
pip3 install -r requirements.txt apt-get install -y ffmpeg
-
Build a Docker image (optional):
docker build -t cosmos-docker -f Dockerfile docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} --workdir ${PWD} cosmos-docker /bin/bash
Step 2: Download Pre-Trained Checkpoints
- Use Hugging Face to download pre-trained model checkpoints:
from huggingface_hub import login, snapshot_download import os login(token=<YOUR-HF-TOKEN>, add_to_git_credential=True) model_names = [ "Cosmos-Tokenizer-CI8x8", "Cosmos-Tokenizer-CI16x16", "Cosmos-Tokenizer-CV4x8x8", "Cosmos-Tokenizer-CV8x8x8", "Cosmos-Tokenizer-CV8x16x16", "Cosmos-Tokenizer-DI8x8", "Cosmos-Tokenizer-DI16x16", "Cosmos-Tokenizer-DV4x8x8", "Cosmos-Tokenizer-DV8x8x8", "Cosmos-Tokenizer-DV8x16x16", ] for model_name in model_names: hf_repo = "nvidia/" + model_name local_dir = "pretrained_ckpts/" + model_name os.makedirs(local_dir, exist_ok=True) snapshot_download(repo_id=hf_repo, local_dir=local_dir)
Step 3: Run Inference
- Encode and decode images or videos:
import torch from cosmos_tokenizer.video_lib import CausalVideoTokenizer model_name = "Cosmos-Tokenizer-DV4x8x8" input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16) encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit') (indices, codes) = encoder.encode(input_tensor) decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit') reconstructed_tensor = decoder.decode(indices)
For optimal performance, it's recommended to use NVIDIA's Ampere or Hopper GPUs, such as the A100 or H100.
License
The Cosmos Tokenizer is released under the NVIDIA Open Model License. This license allows for commercial use, creation and distribution of derivative models, and does not claim ownership of outputs generated using the models. More details can be found in the NVIDIA Open Model License Agreement.