Cosmos 1.0 Autoregressive 13 B Video2 World
nvidiaIntroduction
Cosmos-1.0-Autoregressive-13B-Video2World is part of NVIDIA's Cosmos suite, a collection of pre-trained world foundation models designed to generate physics-aware videos and world states for physical AI development. This model leverages autoregressive techniques to predict and generate video sequences from video or image inputs, making it suitable for various AI applications in world generation.
Architecture
The Cosmos-1.0-Autoregressive-13B-Video2World model is an autoregressive transformer. It utilizes interleaved self-attention, cross-attention, and feedforward layers. The cross-attention layers allow it to condition on input text during the decoding process. The model supports input types like text, image, and video, and outputs video sequences.
Training
The model is trained to predict future video frames based on given text descriptions and initial video or image inputs. It is optimized for generating sequences from text descriptions combined with either a 9-frame input video or a single image, producing 24 or 32-frame video outputs, respectively.
Guide: Running Locally
-
Set Up Environment
- Ensure you have a compatible Linux system with the necessary software dependencies installed, including Python and PyTorch.
-
Install Required Libraries
- Use package managers like
pip
to install the required libraries such as Transformers and any NVIDIA-specific libraries.
- Use package managers like
-
Download the Model
- Access the model from Hugging Face's model repository and download it locally.
-
Run Inference
- Load the model and input data (text description, image/video) into your runtime environment.
- Configure the model parameters and initiate the inference process to generate video outputs.
-
Hardware Recommendations
- A powerful GPU is recommended for efficient inference, such as NVIDIA's Blackwell, Hopper, or Ampere series. Consider utilizing cloud GPUs for enhanced performance and scalability.
License
The Cosmos-1.0-Autoregressive-13B-Video2World model is released under the NVIDIA Open Model License. This license permits commercial use and the creation and distribution of derivative models. NVIDIA does not claim ownership of outputs generated using the models. The license requires adherence to NVIDIA's Trustworthy AI terms and prohibits bypassing technical limitations or safety mechanisms embedded in the model. For more details, refer to the NVIDIA Open Model License.