A I D O. Protein2 Structure Token 16 B
genbio-aiIntroduction
AIDO.Protein2StructureToken-16B is a fine-tuned model for predicting protein structures from amino acid sequences. It is based on AIDO.Protein-16B and works by converting input sequences into structure tokens, which can then be decoded into 3D structures. This model demonstrates superior performance over existing models like ESM3-open.
Architecture
The model architecture is derived from the transformer encoder-only design of AIDO.Protein-16B, featuring sparse Mixture of Experts (MoE) layers instead of dense MLP layers. Each token activates two experts via a top-2 routing mechanism. Key architectural parameters include:
- Number of Attention Heads: 36
- Number of Hidden Layers: 36
- Hidden Size: 2304
- Number of MoE Layers per Block: 8
- Number of MoE Layers per Token: 2
- Input Vocabulary Size: 44 (amino acids + special tokens)
- Output Vocabulary Size: 512 (structure tokens)
- Context Length: 1024
Training
The model was fine-tuned using 0.4 trillion tokens, leveraging data from the AlphaFold and PDB databases. Training was conducted over 20 days using 64 NVIDIA A100 GPUs. Key training details include:
- Global Batch Size: 2048
- Context Length: 1024
- Precision: FP16
- Maximum Learning Rate: 1e-4
- Scheduler: Cosine decay with 2.5% warmup
- Total Tokens Trained: 4 trillion
- Training Steps: 200k
Guide: Running Locally
Structure Prediction Steps:
- Install Model Generator: GitHub - ModelGenerator
- Run Prediction: Use
mgen predict
with the provided YAML configuration. - Convert Output: Use Python scripts to convert
.tsv
to.pt
and extract the codebook. - Decode Structures: Run the decoding command to obtain 3D structures in PDB format.
- Comparison: Validate predictions against ground truth data.
For optimal performance, consider using cloud-based GPUs like NVIDIA A100.
License
The model is distributed under a custom license. For more details, please refer to the Hugging Face model card or contact the authors directly.