msmarco Mini L M L6 en de v1
cross-encoderIntroduction
The cross-encoder/msmarco-MiniLM-L6-en-de-v1
is a cross-lingual Cross-Encoder model for English and German, designed for passage re-ranking tasks. It is trained using the MS MARCO Passage Ranking dataset to enhance information retrieval, particularly useful for retrieving and re-ranking documents based on relevance to a given query.
Architecture
This model is built using MiniLM, a smaller and efficient language model architecture that facilitates effective cross-lingual passage re-ranking. The architecture is optimized for high throughput and accurate document relevance ranking, making it suitable for large-scale information retrieval tasks.
Training
The model was trained on the MS MARCO dataset, focusing on the passage ranking task. Performance evaluation was conducted on datasets such as TREC-DL19 (both EN-EN and DE-EN) and GermanDPR DE-DE. Metrics such as NDCG@10 and MRR@10 were used to measure effectiveness, with this model achieving competitive scores, significantly outperforming the BM25 baseline.
Guide: Running Locally
To run the model locally, you can use either the SentenceTransformers or Transformers library.
Using SentenceTransformers:
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/msmarco-MiniLM-L6-en-de-v1', max_length=512)
query = 'How many people live in Berlin?'
docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
pairs = [(query, doc) for doc in docs]
scores = model.predict(pairs)
Using Transformers:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/msmarco-MiniLM-L6-en-de-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/msmarco-MiniLM-L6-en-de-v1')
features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores = model(**features).logits
print(scores)
Cloud GPU Recommendation:
For optimal performance, especially when processing large datasets, leverage cloud GPUs such as the NVIDIA V100, which can handle a high number of document-query pairs per second.
License
This model is licensed under the Apache-2.0 License, allowing for wide use and modification under specified conditions.