Cosmos 1.0 Tokenizer D V8x16x16
nvidiaIntroduction
The Cosmos Tokenizer is a suite of visual tokenizers designed by NVIDIA for compressing images and videos. It provides high reconstruction quality and is suitable for both diffusion-based and autoregressive models. The tool can be used commercially and supports various compression rates.
Architecture
The Cosmos Tokenizer features a lightweight architecture with a temporally causal design using causal temporal convolution and attention layers. The encoder and decoder form a symmetrical pair, employing a 2-level Haar wavelet transform to down-sample inputs. The encoder models continuous latents using an autoencoder, while discrete tokenizers utilize Finite-Scalar-Quantization (FSQ).
Training
Cosmos Tokenizer comes with pre-trained models that offer different types of tokenizers for both continuous and discrete data. The models are designed to compress visual data efficiently, achieving up to 2048x total compression factors, outperforming state-of-the-art methods in speed and quality.
Guide: Running Locally
-
Installation: Clone the Cosmos-Tokenizer repository from GitHub and install the necessary dependencies.
git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git cd Cosmos-Tokenizer pip3 install -r requirements.txt apt-get install -y ffmpeg
Optionally, build and run a Docker container:
docker build -t cosmos-docker -f Dockerfile. docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} --workdir ${PWD} cosmos-docker /bin/bash
-
Pre-trained Checkpoints: Download the pre-trained models from Hugging Face.
from huggingface_hub import login, snapshot_download import os login(token=<YOUR-HF-TOKEN>, add_to_git_credential=True) model_names = ["Cosmos-1.0-Tokenizer-DV8x16x16"] for model_name in model_names: hf_repo = "nvidia/" + model_name local_dir = "pretrained_ckpts/" + model_name os.makedirs(local_dir, exist_ok=True) snapshot_download(repo_id=hf_repo, local_dir=local_dir)
-
Inference: Run the tokenizer for encoding and decoding.
import torch from cosmos_tokenizer.video_lib import CausalVideoTokenizer model_name = "Cosmos-1.0-Tokenizer-DV8x16x16" input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16) encoder = CausalVideoTokenizer(checkpoint_enc=f'pretrained_ckpts/{model_name}/encoder.jit') indices, codes = encoder.encode(input_tensor) decoder = CausalVideoTokenizer(checkpoint_dec=f'pretrained_ckpts/{model_name}/decoder.jit') reconstructed_tensor = decoder.decode(indices)
Suggested Cloud GPUs: The model is compatible with NVIDIA Ampere (e.g., A100) and NVIDIA Hopper (e.g., H100) GPUs, using BF16 precision. It's recommended to use these models for optimal performance.
License
Cosmos Tokenizer is licensed under the NVIDIA Open Model License. This license allows for commercial use, the creation of derivative models, and does not claim ownership of outputs. Breaching any technical limitations will result in automatic termination of the license. For custom licensing, contact cosmos-license@nvidia.com.