Introduction

esm3-sm-open-v1 is a generative model trained on a vast dataset of protein sequences, structures, and functions. It is designed to create proteins based on partial prompts related to sequence, structure, and function while ensuring safety by removing data related to viruses and potentially harmful organisms.

Architecture

The model was trained on 2.78 billion natural proteins, with synthetic data augmentation increasing the dataset to 3.15 billion protein sequences. It includes 236 million protein structures and 539 million proteins with function annotations, totaling 771 billion tokens. The model architecture supports protein design by generating new sequences conditioned on existing ones.

Training

The training process involved a large-scale dataset of natural and augmented protein data. The model emphasizes safety by excluding data associated with viruses and harmful organisms. It uses a function decoder that filters out potentially dangerous keywords.

Guide: Running Locally

To run the ESM3 model locally:

  1. Install Dependencies:

    • Use the Python package manager to install the required modules:
      pip install esm
      
  2. Access Resources:

  3. Hardware Recommendations:

    • For optimal performance, it is recommended to use cloud-based GPUs. Providers like AWS, Google Cloud, and Azure offer suitable configurations for intensive computational tasks.

License

The ESM3 model is distributed under a custom non-commercial license. It is intended for non-commercial use only, and users must adhere to the EvolutionaryScale Community License Agreement.

More Related APIs