Introduction

Geneformer is a foundational transformer model designed for context-aware predictions in network biology, focusing on single-cell genomics. It was initially pretrained on a vast corpus of single-cell transcriptomes to facilitate predictions in data-limited scenarios.

Architecture

Geneformer employs a transformer architecture, where each cell's transcriptome is encoded as rank values representing gene expression levels. These values are scaled across the Genecorpus-30M dataset to prioritize genes that are significant in distinguishing cell states. The model uses N layers of transformer encoder units, with N varying according to the model size. Pretraining uses a masked learning objective, masking 15% of genes and predicting them based on the context of unmasked genes.

Training

Geneformer underwent initial pretraining on Genecorpus-30M, a dataset of approximately 30 million single-cell transcriptomes. In April 2024, an expanded version was pretrained on approximately 95 million non-cancer transcriptomes, followed by continual learning with ~14 million cancer transcriptomes. This approach enhances the model's ability to understand network dynamics, allowing it to perform well in downstream tasks with zero-shot learning and fine-tuning.

Guide: Running Locally

To run Geneformer locally, the following steps should be taken:

  1. Install Git LFS:

    • Visit Git LFS for installation instructions.
  2. Clone the Repository:

    git lfs install
    git clone https://huggingface.co/ctheodoris/Geneformer
    cd Geneformer
    pip install .
    
  3. Utilize Examples:

    • Access usage examples for tasks like tokenizing transcriptomes, pretraining, fine-tuning, and in silico perturbation here.
  4. Recommended Hardware:

    • Use cloud GPUs for efficient processing, as GPU resources are essential for optimal performance.

License

Geneformer is released under the Apache 2.0 License, allowing for both academic and commercial use.

More Related APIs in Fill Mask