prot_t5_xl_half_uniref50 enc

Rostlab

Introduction

The ProtT5-XL-UniRef50 model is an encoder-only, half-precision protein language model designed for creating protein sequence embeddings. It is part of the ProtTrans project and was pretrained on a large corpus of protein sequences using a self-supervised learning approach.

Architecture

ProtT5-XL-UniRef50 is based on the T5-3B model but utilizes a Bart-like masked language model (MLM) denoising objective, differing from the original T5 span denoising objective. The model is trained to mask 15% of amino acids randomly in its input, and it only includes the encoder portion. It uses half-precision (float16) to efficiently create protein or amino acid representations.

Training

The model was trained on uppercase amino acids and can efficiently generate embeddings with minimal GPU memory usage, making it suitable for downstream tasks. It was pretrained using a large dataset of protein sequences, leveraging publicly available data without human labeling.

Guide: Running Locally

To run this model locally:

  1. Installation: Install the necessary packages like transformers and torch.
  2. Model Loading: Load the model with T5EncoderModel.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', torch_dtype=torch.float16).
  3. Tokenization: Tokenize input sequences and pad them using tokenizer.batch_encode_plus.
  4. Embedding Generation: Use the model to generate embeddings by passing input_ids and attention_mask.
  5. GPU Recommendation: It is recommended to use a GPU with at least 8 GB of VRAM. Cloud GPUs such as those available from AWS, Google Cloud, or Azure can be useful.

Note: Half-precision models currently do not run on CPUs. To use the model on a CPU, you must convert it to full precision using model=model.float().

License

The ProtT5-XL-UniRef50 model and its associated resources are part of the ProtTrans project, available at ProtTrans GitHub. Ensure compliance with any license agreements or terms specified in the repository or by Hugging Face when using or distributing this model.

More Related APIs