prot_t5_xl_half_uniref50 enc
RostlabIntroduction
The ProtT5-XL-UniRef50 model is an encoder-only, half-precision protein language model designed for creating protein sequence embeddings. It is part of the ProtTrans project and was pretrained on a large corpus of protein sequences using a self-supervised learning approach.
Architecture
ProtT5-XL-UniRef50 is based on the T5-3B model but utilizes a Bart-like masked language model (MLM) denoising objective, differing from the original T5 span denoising objective. The model is trained to mask 15% of amino acids randomly in its input, and it only includes the encoder portion. It uses half-precision (float16) to efficiently create protein or amino acid representations.
Training
The model was trained on uppercase amino acids and can efficiently generate embeddings with minimal GPU memory usage, making it suitable for downstream tasks. It was pretrained using a large dataset of protein sequences, leveraging publicly available data without human labeling.
Guide: Running Locally
To run this model locally:
- Installation: Install the necessary packages like
transformers
andtorch
. - Model Loading: Load the model with
T5EncoderModel.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', torch_dtype=torch.float16)
. - Tokenization: Tokenize input sequences and pad them using
tokenizer.batch_encode_plus
. - Embedding Generation: Use the model to generate embeddings by passing
input_ids
andattention_mask
. - GPU Recommendation: It is recommended to use a GPU with at least 8 GB of VRAM. Cloud GPUs such as those available from AWS, Google Cloud, or Azure can be useful.
Note: Half-precision models currently do not run on CPUs. To use the model on a CPU, you must convert it to full precision using model=model.float()
.
License
The ProtT5-XL-UniRef50 model and its associated resources are part of the ProtTrans project, available at ProtTrans GitHub. Ensure compliance with any license agreements or terms specified in the repository or by Hugging Face when using or distributing this model.