contriever msmarco
facebookIntroduction
Contriever-MSMARCO is a finetuned version of the pre-trained Contriever model, developed by Meta AI. It follows the approach described in the paper "Towards Unsupervised Dense Information Retrieval with Contrastive Learning." The model is designed for feature extraction and obtaining text embeddings, making it suitable for applications in dense information retrieval.
Architecture
The model is based on the Transformers library and utilizes PyTorch. It incorporates a mean pooling operation to generate sentence embeddings from token embeddings. The architecture is designed to handle tasks related to feature extraction and is compatible with various inference endpoints.
Training
The Contriever-MSMARCO model is finetuned using the MSMARCO dataset, which is a large-scale dataset for training information retrieval models. The pre-trained Contriever model serves as the base, and it undergoes additional training to enhance its performance on retrieval tasks.
Guide: Running Locally
To run the Contriever-MSMARCO model locally, you need to follow these steps:
-
Install Dependencies: Ensure that you have Python and PyTorch installed. Also, install the Transformers library from Hugging Face.
pip install torch transformers
-
Load the Model and Tokenizer:
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained('facebook/contriever-msmarco') model = AutoModel.from_pretrained('facebook/contriever-msmarco')
-
Prepare Input Data:
sentences = [ "Where was Marie Curie born?", "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." ] inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-
Compute Embeddings:
outputs = model(**inputs) def mean_pooling(token_embeddings, mask): token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.) sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None] return sentence_embeddings embeddings = mean_pooling(outputs[0], inputs['attention_mask'])
For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The Contriever-MSMARCO model and its associated resources are available under the licensing terms specified by Meta AI and Hugging Face, accessible through their respective repositories and platforms.