msmarco spanish mt5 base v1

doc2query

Introduction

The MSMARCO-SPANISH-MT5-BASE-V1 model is a doc2query model based on mT5, designed for document expansion and domain-specific training data generation. It generates queries for paragraphs to improve search engine performance and create training data for embedding models.

Architecture

The model is built on Google's mT5 architecture and fine-tuned with MS MARCO data. It uses seq2seq learning to produce diverse queries from given text inputs. The model supports both beam-search for high-quality queries and top_k sampling for diverse outputs.

Training

The model was fine-tuned on the mMARCO dataset with 66k training steps over four epochs, using 500k training pairs. The input text is truncated to 320 word pieces, and the generated output is limited to 64 word pieces.

Guide: Running Locally

  1. Setup: Install the required libraries, particularly transformers and torch.
  2. Load Model:
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    model_name = 'doc2query/msmarco-spanish-mt5-base-v1'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
  3. Generate Queries: Use the create_queries function to produce queries from your text.
  4. Hardware Suggestion: For better performance, use cloud GPUs like AWS EC2 instances equipped with NVIDIA GPUs.

License

The MSMARCO-SPANISH-MT5-BASE-V1 model is licensed under the Apache 2.0 License.

More Related APIs in Text2text Generation