prot_t5_xl_uniref50
RostlabIntroduction
ProtT5-XL-UniRef50 is a pretrained protein language model developed by Rostlab. It is designed for text-to-text generation tasks using transformers, specifically within the domain of protein sequences. The model is trained using a masked language modeling (MLM) objective, and it is primarily used for extracting features from protein sequences for various computational biology tasks.
Architecture
ProtT5-XL-UniRef50 is based on the T5-3B model architecture and employs a self-supervised learning approach. Unlike the original T5 model, which uses a span denoising objective, this model uses a Bart-like MLM objective, masking 15% of amino acids in the input sequences. The model's training incorporates the encoder-decoder architecture and has approximately 3 billion parameters.
Training
The model was pretrained on the UniRef50 dataset, which consists of 45 million protein sequences. The training process involved preprocessing protein sequences by uppercasing them, tokenizing using a vocabulary size of 21, and replacing rare amino acids ("U", "Z", "O", "B") with "X". Training was conducted using a TPU Pod V2-256 with a batch size of 2k, over 991.5k steps, starting from the ProtT5-XL-BFD model checkpoint. The optimizer used was AdaFactor with an inverse square root learning rate schedule.
Guide: Running Locally
To run ProtT5-XL-UniRef50 locally:
- Install Dependencies: Ensure you have PyTorch and the Hugging Face Transformers library installed.
- Load the Model: Use the model's tokenizer to encode protein sequences.
- Preprocess Sequences: Replace rare amino acids with "X" and tokenize.
- Generate Embeddings: Use the model to generate embeddings for protein sequences.
For enhanced performance, consider using cloud GPUs such as AWS EC2 with NVIDIA Tesla V100 or A100 instances.
License
The ProtT5-XL-UniRef50 model is available under the MIT License.