doc2query t5 base msmarco
castoriniIntroduction
The doc2query-t5-base-msmarco
is a text-to-text generation model from the CASTORINI group, designed to enhance document retrieval by generating synthetic queries for documents, improving their retrievability. This model is based on the T5 architecture and fine-tuned on the MSMARCO dataset.
Architecture
The model utilizes the T5 (Text-to-Text Transfer Transformer) architecture, which is capable of performing a wide range of natural language processing tasks by recasting them as text-to-text problems. It is built on top of the Transformers library, supporting both PyTorch and JAX backends for flexibility in deployment and training.
Training
The doc2query-t5-base-msmarco
model was trained on the MSMARCO dataset, a large-scale dataset for training AI models in the context of search queries and document retrieval. The training process focuses on generating synthetic queries from existing documents to enhance search engine performance.
Guide: Running Locally
To run the doc2query-t5-base-msmarco
model locally, follow these steps:
-
Install Dependencies: Ensure you have Python installed, and install the Transformers library using pip:
pip install transformers
-
Load the Model: Use the Transformers library to load the model:
from transformers import T5ForConditionalGeneration, T5Tokenizer model_name = "castorini/doc2query-t5-base-msmarco" tokenizer = T5Tokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name)
-
Inference: Prepare your documents and generate synthetic queries:
input_text = "Your document text here." input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-
Hardware Recommendation: For optimal performance, especially with large datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The doc2query-t5-base-msmarco
model is provided under its respective license by CASTORINI. Users should review the terms on the Hugging Face model page or visit doc2query.ai for more information.