prot_t5_xl_uniref50 LLM Model

Introduction

ProtT5-XL-UniRef50 is a pretrained protein language model developed by Rostlab. It is designed for text-to-text generation tasks using transformers, specifically within the domain of protein sequences. The model is trained using a masked language modeling (MLM) objective, and it is primarily used for extracting features from protein sequences for various computational biology tasks.

Architecture

ProtT5-XL-UniRef50 is based on the T5-3B model architecture and employs a self-supervised learning approach. Unlike the original T5 model, which uses a span denoising objective, this model uses a Bart-like MLM objective, masking 15% of amino acids in the input sequences. The model's training incorporates the encoder-decoder architecture and has approximately 3 billion parameters.

Training

The model was pretrained on the UniRef50 dataset, which consists of 45 million protein sequences. The training process involved preprocessing protein sequences by uppercasing them, tokenizing using a vocabulary size of 21, and replacing rare amino acids ("U", "Z", "O", "B") with "X". Training was conducted using a TPU Pod V2-256 with a batch size of 2k, over 991.5k steps, starting from the ProtT5-XL-BFD model checkpoint. The optimizer used was AdaFactor with an inverse square root learning rate schedule.

Guide: Running Locally

To run ProtT5-XL-UniRef50 locally:

Install Dependencies: Ensure you have PyTorch and the Hugging Face Transformers library installed.
Load the Model: Use the model's tokenizer to encode protein sequences.
Preprocess Sequences: Replace rare amino acids with "X" and tokenize.
Generate Embeddings: Use the model to generate embeddings for protein sequences.

For enhanced performance, consider using cloud GPUs such as AWS EC2 with NVIDIA Tesla V100 or A100 instances.

License

The ProtT5-XL-UniRef50 model is available under the MIT License.

More Related APIs in Text2text Generation