turkish colbert
ytu-ce-cosmosIntroduction
TURKISH-COLBERT is a Turkish passage retrieval model based on the ColBERT architecture, developed by the COSMOS Research Group at Yildiz Technical University. It fine-tunes the Turkish Base BERT model on a dataset of 500k triplets derived from a Turkish-translated version of the MS MARCO dataset. The model is equipped to handle passage retrieval tasks specifically for the Turkish language.
Architecture
The model utilizes the ColBERT architecture, which is designed for efficient and accurate passage retrieval. The base model used is a version of Turkish BERT that is uncased, requiring manual lowercase conversion due to tokenizer issues.
Training
The model is trained on 500k triplets consisting of a query, a positive passage, and a negative passage, from a Turkish-translated version of the MS MARCO dataset. Training was supported by Cloud TPUs from Google's TensorFlow Research Cloud and storage from Hugging Face.
Guide: Running Locally
-
Installation: Install the necessary package using pip:
!pip install ragatouille
-
Model Loading: Load the model with:
from ragatouille import RAGPretrainedModel rag = RAGPretrainedModel.from_pretrained("ytu-ce-cosmos/turkish-colbert")
-
Data Preparation: Ensure all text is converted to lowercase using:
docs = [doc.replace("I", "ı").lower() for doc in docs]
-
Indexing and Querying:
rag.index(docs, index_name="sampleTest") query = query.replace("I", "ı").lower() results = rag.search(query, k=1) print(results[0]['content'])
-
Suggested Cloud GPUs: For optimal performance, consider using cloud-based GPUs such as those from AWS, Google Cloud, or Microsoft Azure.
License
The TURKISH-COLBERT model is released under the MIT License, allowing broad usage and modification rights.