contriever msmarco LLM Model

Introduction

Contriever-MSMARCO is a finetuned version of the pre-trained Contriever model, developed by Meta AI. It follows the approach described in the paper "Towards Unsupervised Dense Information Retrieval with Contrastive Learning." The model is designed for feature extraction and obtaining text embeddings, making it suitable for applications in dense information retrieval.

Architecture

The model is based on the Transformers library and utilizes PyTorch. It incorporates a mean pooling operation to generate sentence embeddings from token embeddings. The architecture is designed to handle tasks related to feature extraction and is compatible with various inference endpoints.

Training

The Contriever-MSMARCO model is finetuned using the MSMARCO dataset, which is a large-scale dataset for training information retrieval models. The pre-trained Contriever model serves as the base, and it undergoes additional training to enhance its performance on retrieval tasks.

Guide: Running Locally

To run the Contriever-MSMARCO model locally, you need to follow these steps:

Install Dependencies: Ensure that you have Python and PyTorch installed. Also, install the Transformers library from Hugging Face.
```
pip install torch transformers
```

Load the Model and Tokenizer:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('facebook/contriever-msmarco')
model = AutoModel.from_pretrained('facebook/contriever-msmarco')

Prepare Input Data:

sentences = [
    "Where was Marie Curie born?",
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

Compute Embeddings:

outputs = model(**inputs)

def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

embeddings = mean_pooling(outputs[0], inputs['attention_mask'])

For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The Contriever-MSMARCO model and its associated resources are available under the licensing terms specified by Meta AI and Hugging Face, accessible through their respective repositories and platforms.

More Related APIs in Feature Extraction