msmarco Mini L M L6 en de v1

cross-encoder

Introduction

The cross-encoder/msmarco-MiniLM-L6-en-de-v1 is a cross-lingual Cross-Encoder model for English and German, designed for passage re-ranking tasks. It is trained using the MS MARCO Passage Ranking dataset to enhance information retrieval, particularly useful for retrieving and re-ranking documents based on relevance to a given query.

Architecture

This model is built using MiniLM, a smaller and efficient language model architecture that facilitates effective cross-lingual passage re-ranking. The architecture is optimized for high throughput and accurate document relevance ranking, making it suitable for large-scale information retrieval tasks.

Training

The model was trained on the MS MARCO dataset, focusing on the passage ranking task. Performance evaluation was conducted on datasets such as TREC-DL19 (both EN-EN and DE-EN) and GermanDPR DE-DE. Metrics such as NDCG@10 and MRR@10 were used to measure effectiveness, with this model achieving competitive scores, significantly outperforming the BM25 baseline.

Guide: Running Locally

To run the model locally, you can use either the SentenceTransformers or Transformers library.

Using SentenceTransformers:

from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/msmarco-MiniLM-L6-en-de-v1', max_length=512)
query = 'How many people live in Berlin?'
docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
pairs = [(query, doc) for doc in docs]
scores = model.predict(pairs)

Using Transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/msmarco-MiniLM-L6-en-de-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/msmarco-MiniLM-L6-en-de-v1')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

Cloud GPU Recommendation:

For optimal performance, especially when processing large datasets, leverage cloud GPUs such as the NVIDIA V100, which can handle a high number of document-query pairs per second.

License

This model is licensed under the Apache-2.0 License, allowing for wide use and modification under specified conditions.

More Related APIs in Text Classification