Introduction

RDT-1B is a 1 billion parameter imitation learning Diffusion Transformer model pre-trained on over 1 million multi-robot episodes. It is designed to predict robot actions from language instructions and RGB images. The model supports various modern mobile manipulators, including single-arm, dual-arm, and wheeled robots.

Architecture

  • Developed by: TSAIL group, Tsinghua University
  • Task Type: Vision-Language-Action
  • Model Type: Diffusion Policy with Transformers
  • Multi-Modal Encoders:
    • Vision Backbone: siglip-so400m-patch14-384
    • Language Model: t5-v1_1-xxl
  • Pre-Training Datasets: Utilizes 46 datasets including RT-1 Dataset, RH20T, DROID, and others.

Training

RDT-1B is trained to take language instructions, RGB images, control frequency, and proprioception as input to predict the next 64 robot actions. The model uses a unified action space to accommodate different robot platforms, although it may require fine-tuning for new, unseen platforms.

Guide: Running Locally

  1. Clone the Repository:

    git clone https://github.com/thu-ml/RoboticsDiffusionTransformer
    cd RoboticsDiffusionTransformer
    
  2. Install Dependencies: Follow the instructions in the repository to set up the environment.

  3. Create and Configure the Model:

    from scripts.agilex_model import create_model
    config = {
        'episode_len': 1000,
        'state_dim': 14,
        'chunk_size': 64,
        'camera_names': ['cam_high', 'cam_right_wrist', 'cam_left_wrist'],
    }
    model = create_model(
        args=config,
        dtype=torch.bfloat16,
        pretrained_vision_encoder_name_or_path="google/siglip-so400m-patch14-384",
        pretrained='robotics-diffusion-transformer/rdt-1b',
        control_frequency=25,
    )
    
  4. Perform Inference: Load pre-computed language embeddings and use the model to predict actions.

Cloud GPUs: For optimal performance, consider using cloud services like AWS EC2, Google Cloud, or Azure for access to high-performance GPUs.

License

The RDT-1B model, code, pre-trained weights, and data are available under the MIT license.

More Related APIs in Robotics