Zym C T R L
AI4PDIntroduction
ZymCTRL is a conditional language model designed for generating artificial functional enzymes based on user-defined Enzyme Commission (EC) numbers. It was trained on the UniProt database, encompassing over 37 million sequences with EC annotations. The model generates protein sequences that are novel yet exhibit desired catalytic properties.
Architecture
ZymCTRL is based on the CTRL Transformer architecture similar to ChatGPT. It consists of 36 layers with a model dimensionality of 1280, totaling 738 million parameters. It is a decoder-only transformer pre-trained on enzyme sequences from the UniProt database, with EC classes prepended to each sequence. The model learns to predict the next token in a sequence, enabling it to understand the dependencies between EC classes and sequences.
Training
The model was trained using an autoregressive objective on 48 NVIDIA A100 GPUs for eight epochs, with a block size of 1024 and a total batch size of 768. The Adam optimizer was used with parameters beta1 = 0.9, beta2 = 0.999, and a learning rate of 0.8e-04.
Guide: Running Locally
To run ZymCTRL locally:
- Install Dependencies: Make sure you have the Hugging Face Transformers library installed.
- Download ZymCTRL: Download the model to your local directory.
- Prepare Environment: Ensure you have Python and PyTorch installed. A GPU is recommended for faster processing. Cloud GPUs like AWS or Google Cloud can be used.
- Run Generation Script: Use the provided Python script to generate sequences. Modify paths as needed. Run the script using
python generate.py
.
Consider using cloud GPU instances for efficient computation and faster results.
License
ZymCTRL is licensed under the Apache-2.0 license.