Cosmos 1.0 Diffusion 14 B Text2 World
nvidiaIntroduction
Cosmos-1.0-Diffusion-14B-Text2World is a highly performant diffusion transformer model developed by NVIDIA. It is designed for generating dynamic, high-quality videos based on text inputs, suitable for applications in world generation and physical AI development. The model allows users to create physics-aware videos from text descriptions and is commercially usable under the NVIDIA Open Model License.
Architecture
The model employs a diffusion transformer architecture optimized for video denoising in latent space. It integrates self-attention, cross-attention, and feedforward layers to condition on input text during the denoising process. Adaptive layer normalization embeds time information, and conditional latent frames can be concatenated with generated frames for video input. The architecture supports augment noise to mitigate training and inference gaps.
Training
The Cosmos diffusion models, including this 14B-parameter variant, utilize large-scale pre-training to enhance performance in generating video content from textual descriptions. These models are fine-tuned to manage inputs under 300 words, producing 121-frame video outputs with configurable resolutions and frame rates.
Guide: Running Locally
To run Cosmos-1.0-Diffusion-14B-Text2World locally, follow these steps:
-
Setup Environment
- Ensure you have a compatible operating system, preferably Linux.
- Install necessary dependencies, including the Cosmos runtime engine available on GitHub.
-
Install NVIDIA Hardware & Drivers
- Use NVIDIA GPUs such as Blackwell, Hopper, or Ampere for optimal performance.
- Install appropriate CUDA and cuDNN versions.
-
Download Models
- Access the model on Hugging Face and download it along with any required checkpoints.
-
Inference
- Execute the model using the provided scripts to generate videos from text inputs.
- Consider using cloud GPUs like NVIDIA A100 or H100 for efficient processing, especially if your local setup has limited memory.
-
Optimize Memory Usage
- Utilize model offloading strategies to manage GPU memory effectively, suitable for systems with limited resources.
License
Cosmos-1.0-Diffusion-14B-Text2World is released under the NVIDIA Open Model License. This license permits commercial use, derivative model creation, and distribution. It requires adherence to NVIDIA's AI ethics guidelines and includes terms for redistribution and liability.