Introduction
RDT-1B is a 1 billion parameter imitation learning Diffusion Transformer model pre-trained on over 1 million multi-robot episodes. It is designed to predict robot actions from language instructions and RGB images. The model supports various modern mobile manipulators, including single-arm, dual-arm, and wheeled robots.
Architecture
- Developed by: TSAIL group, Tsinghua University
- Task Type: Vision-Language-Action
- Model Type: Diffusion Policy with Transformers
- Multi-Modal Encoders:
- Vision Backbone:
siglip-so400m-patch14-384
- Language Model:
t5-v1_1-xxl
- Vision Backbone:
- Pre-Training Datasets: Utilizes 46 datasets including RT-1 Dataset, RH20T, DROID, and others.
Training
RDT-1B is trained to take language instructions, RGB images, control frequency, and proprioception as input to predict the next 64 robot actions. The model uses a unified action space to accommodate different robot platforms, although it may require fine-tuning for new, unseen platforms.
Guide: Running Locally
-
Clone the Repository:
git clone https://github.com/thu-ml/RoboticsDiffusionTransformer cd RoboticsDiffusionTransformer
-
Install Dependencies: Follow the instructions in the repository to set up the environment.
-
Create and Configure the Model:
from scripts.agilex_model import create_model config = { 'episode_len': 1000, 'state_dim': 14, 'chunk_size': 64, 'camera_names': ['cam_high', 'cam_right_wrist', 'cam_left_wrist'], } model = create_model( args=config, dtype=torch.bfloat16, pretrained_vision_encoder_name_or_path="google/siglip-so400m-patch14-384", pretrained='robotics-diffusion-transformer/rdt-1b', control_frequency=25, )
-
Perform Inference: Load pre-computed language embeddings and use the model to predict actions.
Cloud GPUs: For optimal performance, consider using cloud services like AWS EC2, Google Cloud, or Azure for access to high-performance GPUs.
License
The RDT-1B model, code, pre-trained weights, and data are available under the MIT license.