Cosmos 1.0 Diffusion 7 B Text2 World
nvidiaIntroduction
The Cosmos-1.0-Diffusion-7B-Text2World is a diffusion-based model developed by NVIDIA designed for generating physics-aware videos from text, image, or video inputs. It is part of the Cosmos World Foundation Models, which are pre-trained for world generation applications. The models are intended for commercial use under the NVIDIA Open Model License.
Architecture
Cosmos-1.0-Diffusion-7B-Text2World utilizes a diffusion transformer architecture for video denoising in the latent space. It features interleaved self-attention, cross-attention, and feedforward layers. The model uses cross-attention layers to condition on input text during the denoising process and employs adaptive layer normalization to integrate time information. This architecture allows for the concatenation of latent frames from input images or videos with generated frames.
Training
The model is trained to generate videos with a 5-second duration and a resolution of 1280x704 pixels at 24 frames per second. It can process input text descriptions containing fewer than 300 words. The model's performance and evaluation details are outlined in NVIDIA's technical paper on the Cosmos platform.
Guide: Running Locally
- Setup Environment: Install the necessary dependencies, including the
transformers
library. - Download Model: Access the model from Hugging Face’s repository.
- Hardware Requirements: For optimal performance, use NVIDIA GPUs such as Blackwell, Hopper, or Ampere. GPU memory management strategies are provided for systems with limited memory.
- Run Inference: Load the model and provide text input to generate videos. Adjust frame rate and resolution as needed.
- Cloud Options: Consider using cloud GPUs, such as NVIDIA's H100, for faster inference times.
License
The Cosmos-1.0-Diffusion-7B-Text2World model is distributed under the NVIDIA Open Model License. This license allows for commercial usage, the creation of derivative models, and does not claim ownership of outputs generated with the model. Users must adhere to the terms outlined, including restrictions on bypassing safety mechanisms and compliance with NVIDIA's Trustworthy AI terms.