Plant R N A F M
yanghengIntroduction
PlantRNA-FM is an advanced RNA foundation model designed to explore functional RNA motifs in plants. In the life sciences, understanding RNA sequences and structures is crucial due to their significant roles in plant development and environmental adaptation. PlantRNA-FM leverages cutting-edge AI techniques, integrating extensive RNA sequence and structural data from various plant species to predict RNA functions with high accuracy. This model sets a new standard in RNA bioinformatics by offering insights into RNA motifs within the plant transcriptome.
Architecture
PlantRNA-FM was built using data from the One Thousand Plant Transcriptomes Project (1KP). Modeling genomic sequences requires careful curation due to their strict biological patterns. Key steps in data preparation included truncating sequences over 512 nucleotides, filtering out noise by removing sequences shorter than 20 nucleotides, annotating RNA secondary structures, and identifying CDS and UTR sequences. The model is based on the transformer architecture with 35 million parameters, featuring 12 transformer layers, 24 attention heads, and a 480-dimensional embedding. It focuses solely on the encoder component, supporting sequences up to 512 nucleotides, and was trained on four A100 GPUs for three weeks.
Training
Training required Python 3.9+, PyTorch 2.0+, Transformers 4.38+, and pytorch-cuda 11.0+. The pre-training data consisted of RNA sequences from 1,124 plant species, involving approximately 25.0 million sequences and 54.2 billion RNA bases.
Guide: Running Locally
-
Install Requirements:
- Ensure Python 3.9+ is installed.
- Install dependencies: PyTorch 2.0+, Transformers 4.38+, and pytorch-cuda 11.0+ (conda).
-
Load Model:
from transformers import AutoModel, AutoTokenizer model_name_or_path = "yangheng/PlantRNA-FM" model = AutoModel.from_pretrained(model_name_or_path) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
-
Perform Inference:
rna_sequence = 'GCCGACUUAGCUCAGU<mask>GGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCUGGUUCGAUCCCGGGAGUCGGCACCA' inputs = tokenizer(rna_sequence, return_tensors="pt") outputs = model(**inputs) print(outputs.last_hidden_state)
-
Cloud GPU Recommendation:
- Consider using cloud GPUs like Nvidia RTX 4090 for efficient processing.
License
PlantRNA-FM is distributed under the MIT License. Development was collaboratively undertaken by ColaLAB at the University of Exeter and JIC at Norwich Research Park.