doc2query t5 base msmarco

castorini

Introduction

The doc2query-t5-base-msmarco is a text-to-text generation model from the CASTORINI group, designed to enhance document retrieval by generating synthetic queries for documents, improving their retrievability. This model is based on the T5 architecture and fine-tuned on the MSMARCO dataset.

Architecture

The model utilizes the T5 (Text-to-Text Transfer Transformer) architecture, which is capable of performing a wide range of natural language processing tasks by recasting them as text-to-text problems. It is built on top of the Transformers library, supporting both PyTorch and JAX backends for flexibility in deployment and training.

Training

The doc2query-t5-base-msmarco model was trained on the MSMARCO dataset, a large-scale dataset for training AI models in the context of search queries and document retrieval. The training process focuses on generating synthetic queries from existing documents to enhance search engine performance.

Guide: Running Locally

To run the doc2query-t5-base-msmarco model locally, follow these steps:

  1. Install Dependencies: Ensure you have Python installed, and install the Transformers library using pip:

    pip install transformers
    
  2. Load the Model: Use the Transformers library to load the model:

    from transformers import T5ForConditionalGeneration, T5Tokenizer
    
    model_name = "castorini/doc2query-t5-base-msmarco"
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    
  3. Inference: Prepare your documents and generate synthetic queries:

    input_text = "Your document text here."
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    outputs = model.generate(input_ids)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    
  4. Hardware Recommendation: For optimal performance, especially with large datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The doc2query-t5-base-msmarco model is provided under its respective license by CASTORINI. Users should review the terms on the Hugging Face model page or visit doc2query.ai for more information.

More Related APIs in Text2text Generation