Cosmos 1.0 Diffusion 7 B Text2 World

nvidia

Introduction

The Cosmos-1.0-Diffusion-7B-Text2World is a diffusion-based model developed by NVIDIA designed for generating physics-aware videos from text, image, or video inputs. It is part of the Cosmos World Foundation Models, which are pre-trained for world generation applications. The models are intended for commercial use under the NVIDIA Open Model License.

Architecture

Cosmos-1.0-Diffusion-7B-Text2World utilizes a diffusion transformer architecture for video denoising in the latent space. It features interleaved self-attention, cross-attention, and feedforward layers. The model uses cross-attention layers to condition on input text during the denoising process and employs adaptive layer normalization to integrate time information. This architecture allows for the concatenation of latent frames from input images or videos with generated frames.

Training

The model is trained to generate videos with a 5-second duration and a resolution of 1280x704 pixels at 24 frames per second. It can process input text descriptions containing fewer than 300 words. The model's performance and evaluation details are outlined in NVIDIA's technical paper on the Cosmos platform.

Guide: Running Locally

  1. Setup Environment: Install the necessary dependencies, including the transformers library.
  2. Download Model: Access the model from Hugging Face’s repository.
  3. Hardware Requirements: For optimal performance, use NVIDIA GPUs such as Blackwell, Hopper, or Ampere. GPU memory management strategies are provided for systems with limited memory.
  4. Run Inference: Load the model and provide text input to generate videos. Adjust frame rate and resolution as needed.
  5. Cloud Options: Consider using cloud GPUs, such as NVIDIA's H100, for faster inference times.

License

The Cosmos-1.0-Diffusion-7B-Text2World model is distributed under the NVIDIA Open Model License. This license allows for commercial usage, the creation of derivative models, and does not claim ownership of outputs generated with the model. Users must adhere to the terms outlined, including restrictions on bypassing safety mechanisms and compliance with NVIDIA's Trustworthy AI terms.

More Related APIs