Introduction

ProtBert is a pretrained model based on the BERT architecture, tailored for processing protein sequences. It utilizes masked language modeling (MLM) to learn from raw protein data, enabling it to capture significant biophysical properties and the "grammar" of protein sequences. This model is designed for both feature extraction and fine-tuning on specific protein-related tasks.

Architecture

ProtBert is derived from the BERT model, specifically adapted for handling protein sequences. Unlike the original BERT, ProtBert treats each protein sequence as a separate document, thus omitting the next sentence prediction task. During training, it masks 15% of amino acids in the input, following the standard BERT masking strategy, to predict those amino acids.

Training

Training Data

The model was pretrained on the Uniref100 dataset, which contains 217 million protein sequences. This vast dataset allows the model to learn from a wide variety of protein structures and functions.

Training Procedure

  • Preprocessing: Protein sequences are converted to uppercase and tokenized with a vocabulary size of 21. Rare amino acids are mapped to 'X'. Sequences are formatted with special tokens [CLS] and [SEP].
  • Pretraining: Conducted on TPU Pod V3-512 for 400k steps. The training involved two phases: 300k steps with sequence length 512, and 100k steps with sequence length 2048. The optimizer used was Lamb, with a learning rate of 0.002 and a weight decay of 0.01. The learning rate was warmed up for 40k steps and then linearly decayed.

Guide: Running Locally

  1. Install Dependencies: Ensure that you have the transformers library installed.

    pip install transformers
    
  2. Load the Model:

    from transformers import BertForMaskedLM, BertTokenizer, pipeline
    
    tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
    model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")
    unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
    
  3. Run Inference:

    unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
    
  4. Cloud GPUs: For intensive tasks, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure to speed up computations.

License

ProtBert is released under a suitable open-source license, facilitating research and commercial use. For specific license terms, please refer to the official repository or contact the authors for clarification.

More Related APIs in Fill Mask