convbert base turkish mc4 uncased
dbmdzIntroduction
The Turkish ConvBERT model, known as CONVBERT-BASE-TURKISH-MC4-UNCASED, is part of a series of community-driven BERT, DistilBERT, ELECTRA, and ConvBERT models tailored for the Turkish language. These models are designed to leverage datasets contributed by the Turkish NLP community, with the BERT model named BERTurk. The development and training of these models involved extensive use of the Turkish segment of the multilingual C4 (mC4) corpus.
Architecture
The ConvBERT model is based on the ConvBERT architecture, optimized for the Turkish language using the mC4 corpus. It uses a sequence length of 512 and is trained over 1 million steps. The model employs the original 32k vocabulary from the mC4 corpus instead of generating a new one, ensuring compatibility with existing pre-trained models.
Training
The training process utilized the Turkish part of the mC4 corpus, a significant dataset curated by the AI2 team. The dataset, after filtering out documents with broken encoding, amounted to 242GB, consisting of over 31 billion tokens. Training was conducted on v3-32 TPUs, enabling efficient processing of the large dataset.
Guide: Running Locally
To run the Turkish ConvBERT model locally, follow these steps:
-
Install the Transformers Library: Ensure you have the Hugging Face Transformers library installed.
pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("dbmdz/convbert-base-turkish-mc4-uncased") model = AutoModel.from_pretrained("dbmdz/convbert-base-turkish-mc4-uncased")
-
Inference: Use the tokenizer and model for your desired tasks, such as text classification or fill-mask tasks.
For optimal performance, especially with large models or datasets, consider using cloud-based GPU services such as Google's TensorFlow Research Cloud (TFRC) or other cloud GPU providers.
License
The Turkish ConvBERT model is released under the MIT license, allowing for flexibility in use and distribution.