Geneformer
ctheodorisIntroduction
Geneformer is a foundational transformer model designed for context-aware predictions in network biology, focusing on single-cell genomics. It was initially pretrained on a vast corpus of single-cell transcriptomes to facilitate predictions in data-limited scenarios.
Architecture
Geneformer employs a transformer architecture, where each cell's transcriptome is encoded as rank values representing gene expression levels. These values are scaled across the Genecorpus-30M dataset to prioritize genes that are significant in distinguishing cell states. The model uses N layers of transformer encoder units, with N varying according to the model size. Pretraining uses a masked learning objective, masking 15% of genes and predicting them based on the context of unmasked genes.
Training
Geneformer underwent initial pretraining on Genecorpus-30M, a dataset of approximately 30 million single-cell transcriptomes. In April 2024, an expanded version was pretrained on approximately 95 million non-cancer transcriptomes, followed by continual learning with ~14 million cancer transcriptomes. This approach enhances the model's ability to understand network dynamics, allowing it to perform well in downstream tasks with zero-shot learning and fine-tuning.
Guide: Running Locally
To run Geneformer locally, the following steps should be taken:
-
Install Git LFS:
- Visit Git LFS for installation instructions.
-
Clone the Repository:
git lfs install git clone https://huggingface.co/ctheodoris/Geneformer cd Geneformer pip install .
-
Utilize Examples:
- Access usage examples for tasks like tokenizing transcriptomes, pretraining, fine-tuning, and in silico perturbation here.
-
Recommended Hardware:
- Use cloud GPUs for efficient processing, as GPU resources are essential for optimal performance.
License
Geneformer is released under the Apache 2.0 License, allowing for both academic and commercial use.