Introduction

ColBERT is a fast and accurate retrieval model designed for scalable BERT-based search over extensive text collections, operating efficiently in milliseconds. ColBERT utilizes fine-grained contextual late interaction by encoding passages and queries into token-level embeddings and employs scalable vector-similarity operators for efficient search.

Architecture

ColBERT's architecture involves encoding each passage into a matrix of token-level embeddings. During search, queries are similarly embedded, and passages are retrieved based on their contextual similarity to the query using MaxSim operators. This approach surpasses single-vector representation models in quality while maintaining scalability to large corpora.

Training

ColBERT provides a pre-trained model checkpoint but also supports training from scratch. Training requires a JSONL triples file consisting of query ID, positive passage ID, and negative passage ID. The training process can be distributed across multiple GPUs to optimize performance.

Guide: Running Locally

  1. Environment Setup: Install Python 3.7+ and PyTorch 1.9+. Use the Hugging Face Transformers library. Create a conda environment:

    conda env create -f conda_env[_cpu].yml
    conda activate colbert
    

    Note: A GPU is required for training and indexing.

  2. Preprocessing: Use TSV files for passages and queries.

  3. Download Pre-trained Checkpoint: Obtain the ColBERTv2 checkpoint for initial setup.

  4. Indexing: Encode and store passage matrices for efficient retrieval. Use the following code for indexing:

    from colbert.infra import Run, RunConfig, ColBERTConfig
    from colbert import Indexer
    
    if __name__=='__main__':
        with Run().context(RunConfig(nranks=1, experiment="msmarco")):
            config = ColBERTConfig(nbits=2, root="/path/to/experiments")
            indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
            indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
    
  5. Searching: Retrieve top-k passages from the collection using queries:

    from colbert.data import Queries
    from colbert.infra import Run, RunConfig, ColBERTConfig
    from colbert import Searcher
    
    if __name__=='__main__':
        with Run().context(RunConfig(nranks=1, experiment="msmarco")):
            config = ColBERTConfig(root="/path/to/experiments")
            searcher = Searcher(index="msmarco.nbits=2", config=config)
            queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
            ranking = searcher.search_all(queries, k=100)
            ranking.save("msmarco.nbits=2.ranking.tsv")
    
  6. Cloud GPUs: For better performance, consider using cloud GPUs like Google Colab, which offers free T4 GPUs.

License

ColBERT is licensed under the MIT License, allowing for open-source use and modification.

More Related APIs