query gen msmarco t5 large v1
BeIRIntroduction
The query-gen-msmarco-t5-large-v1
model is based on the T5 architecture and is used for generating queries. It was trained on the MS MARCO Passage Dataset, which includes approximately 500k real search queries from Bing paired with relevant passages. This model facilitates semantic search model training without annotated data by generating synthetic queries.
Architecture
The model utilizes the T5 (Text-to-Text Transfer Transformer) architecture, which is adept at text transformations. It supports libraries such as PyTorch and JAX and is suitable for text2text generation and inference tasks.
Training
The T5-base model, which this version is based on, was trained on the MS MARCO Passage Dataset. This dataset is composed of real-world search queries and their associated passages, enabling the model to learn effective query generation.
Guide: Running Locally
- Installation and Setup:
- Install the
transformers
library from Hugging Face. - Load the pre-trained tokenizer and model:
from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained('model-name') model = T5ForConditionalGeneration.from_pretrained('model-name')
- Install the
- Usage Example:
- Create input data and generate output:
para = "Python is an interpreted, high-level and general-purpose programming language ..." input_ids = tokenizer.encode(para, return_tensors='pt') outputs = model.generate( input_ids=input_ids, max_length=64, do_sample=True, top_p=0.95, num_return_sequences=3) for output in outputs: query = tokenizer.decode(output, skip_special_tokens=True) print(query)
- Create input data and generate output:
- Hardware Recommendations:
- For optimal performance, it is recommended to use cloud GPUs such as those available from AWS, Google Cloud, or Azure.
License
The usage of this model is governed by the Apache 2.0 License, which allows for wide use and modification with proper attribution.