bert multilingual passage reranking msmarco
amberoadIntroduction
The BERT Multilingual Passage Reranking model by AMBEROAD is designed to enhance search result relevancy by reranking passages based on their match with a given query. Leveraging a multilingual version of BERT, the model supports over 100 languages and is optimized for tasks involving passage reranking, particularly improving Elasticsearch results.
Architecture
The model architecture is based on BERT with an added densely connected neural network layer. This layer processes the 768-dimensional [CLS] token to output a single score ranging from -10 to 10, indicating the relevance between the query and the passage. This approach is documented in the associated arXiv paper.
Training
The model was trained using the Microsoft MS MARCO dataset, which includes approximately 400 million tuples of queries and passages marked as relevant or non-relevant. The training involved 400,000 steps, equivalent to 12 hours on a TPU V3-8, using the BERT Multilingual uncased model. Evaluation showed performance comparable to English-only models, with particularly high accuracy in German.
Guide: Running Locally
- Install Dependencies: Ensure you have Python and the Hugging Face Transformers library installed.
- Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco") model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
- Inference: Use the model to score query-passage pairs for reranking.
- Consider Cloud GPUs: For optimal performance, consider using cloud services like AWS or Google Cloud to access GPUs, particularly for large-scale inference tasks.
License
The BERT Multilingual Passage Reranking model is licensed under the Apache 2.0 License, allowing for broad use and modification.