Prot G P T2

nferruz

ProtGPT2 Documentation

Introduction

ProtGPT2 is a language model designed for de novo protein design and engineering. It effectively generates protein sequences by conserving critical features of natural proteins, such as amino acid propensities, structural content, and globularity, while exploring new regions of protein space.

Architecture

ProtGPT2 is based on the GPT2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, amounting to 738 million parameters. It is a decoder-only transformer model, pre-trained on the UniRef50 database without FASTA headers.

Training

ProtGPT2 was trained using a self-supervised learning approach, focusing on raw protein sequences. The training involved predicting the next token in a sequence, enabling the model to learn the protein language. It was trained over 50 epochs on 128 NVIDIA A100 GPUs, using a block size of 512 and a total batch size of 1024, utilizing the Adam optimizer with a learning rate of 1e-3.

Guide: Running Locally

To use ProtGPT2, follow these steps:

  1. Install the HuggingFace Transformers library by following the installation guide.
  2. Initialize the model using the Transformers pipeline for text generation.
  3. Run the model to generate sequences in a zero-shot fashion or fine-tune it with user-defined sequences.

Suggested Cloud GPUs

For optimal performance, consider using cloud GPUs such as NVIDIA A100 or V100, available on platforms like AWS, Google Cloud, or Azure.

License

ProtGPT2 is released under the Apache 2.0 license, permitting usage and modification of the model under the terms of this license.

More Related APIs in Text Generation