colbertv2.0
colbert-irIntroduction
ColBERT is a fast and accurate retrieval model designed for scalable BERT-based search over extensive text collections, operating efficiently in milliseconds. ColBERT utilizes fine-grained contextual late interaction by encoding passages and queries into token-level embeddings and employs scalable vector-similarity operators for efficient search.
Architecture
ColBERT's architecture involves encoding each passage into a matrix of token-level embeddings. During search, queries are similarly embedded, and passages are retrieved based on their contextual similarity to the query using MaxSim operators. This approach surpasses single-vector representation models in quality while maintaining scalability to large corpora.
Training
ColBERT provides a pre-trained model checkpoint but also supports training from scratch. Training requires a JSONL triples file consisting of query ID, positive passage ID, and negative passage ID. The training process can be distributed across multiple GPUs to optimize performance.
Guide: Running Locally
-
Environment Setup: Install Python 3.7+ and PyTorch 1.9+. Use the Hugging Face Transformers library. Create a conda environment:
conda env create -f conda_env[_cpu].yml conda activate colbert
Note: A GPU is required for training and indexing.
-
Preprocessing: Use TSV files for passages and queries.
-
Download Pre-trained Checkpoint: Obtain the ColBERTv2 checkpoint for initial setup.
-
Indexing: Encode and store passage matrices for efficient retrieval. Use the following code for indexing:
from colbert.infra import Run, RunConfig, ColBERTConfig from colbert import Indexer if __name__=='__main__': with Run().context(RunConfig(nranks=1, experiment="msmarco")): config = ColBERTConfig(nbits=2, root="/path/to/experiments") indexer = Indexer(checkpoint="/path/to/checkpoint", config=config) indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
-
Searching: Retrieve top-k passages from the collection using queries:
from colbert.data import Queries from colbert.infra import Run, RunConfig, ColBERTConfig from colbert import Searcher if __name__=='__main__': with Run().context(RunConfig(nranks=1, experiment="msmarco")): config = ColBERTConfig(root="/path/to/experiments") searcher = Searcher(index="msmarco.nbits=2", config=config) queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv") ranking = searcher.search_all(queries, k=100) ranking.save("msmarco.nbits=2.ranking.tsv")
-
Cloud GPUs: For better performance, consider using cloud GPUs like Google Colab, which offers free T4 GPUs.
License
ColBERT is licensed under the MIT License, allowing for open-source use and modification.