msmarco spanish mt5 base v1
doc2queryIntroduction
The MSMARCO-SPANISH-MT5-BASE-V1 model is a doc2query model based on mT5, designed for document expansion and domain-specific training data generation. It generates queries for paragraphs to improve search engine performance and create training data for embedding models.
Architecture
The model is built on Google's mT5 architecture and fine-tuned with MS MARCO data. It uses seq2seq learning to produce diverse queries from given text inputs. The model supports both beam-search for high-quality queries and top_k sampling for diverse outputs.
Training
The model was fine-tuned on the mMARCO dataset with 66k training steps over four epochs, using 500k training pairs. The input text is truncated to 320 word pieces, and the generated output is limited to 64 word pieces.
Guide: Running Locally
- Setup: Install the required libraries, particularly
transformers
andtorch
. - Load Model:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = 'doc2query/msmarco-spanish-mt5-base-v1' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
- Generate Queries: Use the
create_queries
function to produce queries from your text. - Hardware Suggestion: For better performance, use cloud GPUs like AWS EC2 instances equipped with NVIDIA GPUs.
License
The MSMARCO-SPANISH-MT5-BASE-V1 model is licensed under the Apache 2.0 License.