esm3 sm open v1
EvolutionaryScaleIntroduction
esm3-sm-open-v1
is a generative model trained on a vast dataset of protein sequences, structures, and functions. It is designed to create proteins based on partial prompts related to sequence, structure, and function while ensuring safety by removing data related to viruses and potentially harmful organisms.
Architecture
The model was trained on 2.78 billion natural proteins, with synthetic data augmentation increasing the dataset to 3.15 billion protein sequences. It includes 236 million protein structures and 539 million proteins with function annotations, totaling 771 billion tokens. The model architecture supports protein design by generating new sequences conditioned on existing ones.
Training
The training process involved a large-scale dataset of natural and augmented protein data. The model emphasizes safety by excluding data associated with viruses and harmful organisms. It uses a function decoder that filters out potentially dangerous keywords.
Guide: Running Locally
To run the ESM3 model locally:
-
Install Dependencies:
- Use the Python package manager to install the required modules:
pip install esm
- Use the Python package manager to install the required modules:
-
Access Resources:
- Visit the ESM GitHub repository for detailed instructions and example notebooks.
-
Hardware Recommendations:
- For optimal performance, it is recommended to use cloud-based GPUs. Providers like AWS, Google Cloud, and Azure offer suitable configurations for intensive computational tasks.
License
The ESM3 model is distributed under a custom non-commercial license. It is intended for non-commercial use only, and users must adhere to the EvolutionaryScale Community License Agreement.