Cosmos 0.1 Tokenizer C V8x8x8
nvidiaIntroduction
Cosmos Tokenizer is a suite of visual tokenizers developed by NVIDIA for images and videos, designed to provide various compression rates while maintaining high reconstruction quality. It serves as a building block in diffusion-based and autoregressive models for image and video generation, available in both continuous and discrete variants for images and videos.
Architecture
Cosmos Tokenizer employs a lightweight, efficient architecture with a temporally causal design. It uses causal temporal convolution and attention layers to maintain the temporal order of video frames. The symmetrical encoder-decoder architecture uses a 2-level Haar wavelet transform for down-sampling and its inverse for up-sampling. Continuous tokenizers model the latent space using autoencoders, while discrete tokenizers use Finite-Scalar-Quantization for quantization.
Training
The model supports spatial compression rates of 8x8 or 16x16 and temporal compression factors of 4x or 8x, achieving up to 2048x compression. It is optimized for NVIDIA Ampere and Hopper GPUs, using BF16 precision, with support for Linux operating systems. The model is evaluated against standard datasets and a custom benchmark, TokenBench, showing better performance than existing state-of-the-art methods.
Guide: Running Locally
Step 1: Installation of Cosmos-Tokenizer
- Clone the Cosmos-Tokenizer repository.
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git cd Cosmos-Tokenizer
- Install dependencies.
pip3 install -r requirements.txt apt-get install -y ffmpeg
Step 2: Download Pre-Trained Checkpoints
- Use Hugging Face to download the pre-trained checkpoints.
from huggingface_hub import login, snapshot_download import os login(token=<YOUR-HF-TOKEN>, add_to_git_credential=True) model_names = ["Cosmos-Tokenizer-CI8x8", "Cosmos-Tokenizer-CI16x16", "Cosmos-Tokenizer-CV4x8x8", ...] for model_name in model_names: hf_repo = "nvidia/" + model_name local_dir = "pretrained_ckpts/" + model_name os.makedirs(local_dir, exist_ok=True) snapshot_download(repo_id=hf_repo, local_dir=local_dir)
Step 3: Run Inference
- Encode and decode images or videos using the model.
import torch from cosmos_tokenizer.video_lib import CausalVideoTokenizer model_name = "Cosmos-Tokenizer-CV4x8x8" input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16) encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit') latent = encoder.encode(input_tensor) decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit') reconstructed_tensor = decoder.decode(latent)
Cloud GPUs: For enhanced performance, consider using NVIDIA A100 or H100 GPUs available on cloud platforms.
License
The Cosmos Tokenizer is released under the NVIDIA Open Model License, allowing commercial use and distribution of derivative models. NVIDIA does not claim ownership of any outputs generated using these models. The full license can be found here.