A I D O. Protein2 Structure Token 16 B

genbio-ai

Introduction

AIDO.Protein2StructureToken-16B is a fine-tuned model for predicting protein structures from amino acid sequences. It is based on AIDO.Protein-16B and works by converting input sequences into structure tokens, which can then be decoded into 3D structures. This model demonstrates superior performance over existing models like ESM3-open.

Architecture

The model architecture is derived from the transformer encoder-only design of AIDO.Protein-16B, featuring sparse Mixture of Experts (MoE) layers instead of dense MLP layers. Each token activates two experts via a top-2 routing mechanism. Key architectural parameters include:

  • Number of Attention Heads: 36
  • Number of Hidden Layers: 36
  • Hidden Size: 2304
  • Number of MoE Layers per Block: 8
  • Number of MoE Layers per Token: 2
  • Input Vocabulary Size: 44 (amino acids + special tokens)
  • Output Vocabulary Size: 512 (structure tokens)
  • Context Length: 1024

Training

The model was fine-tuned using 0.4 trillion tokens, leveraging data from the AlphaFold and PDB databases. Training was conducted over 20 days using 64 NVIDIA A100 GPUs. Key training details include:

  • Global Batch Size: 2048
  • Context Length: 1024
  • Precision: FP16
  • Maximum Learning Rate: 1e-4
  • Scheduler: Cosine decay with 2.5% warmup
  • Total Tokens Trained: 4 trillion
  • Training Steps: 200k

Guide: Running Locally

Structure Prediction Steps:

  1. Install Model Generator: GitHub - ModelGenerator
  2. Run Prediction: Use mgen predict with the provided YAML configuration.
  3. Convert Output: Use Python scripts to convert .tsv to .pt and extract the codebook.
  4. Decode Structures: Run the decoding command to obtain 3D structures in PDB format.
  5. Comparison: Validate predictions against ground truth data.

For optimal performance, consider using cloud-based GPUs like NVIDIA A100.

License

The model is distributed under a custom license. For more details, please refer to the Hugging Face model card or contact the authors directly.

More Related APIs