bert base turkish cased
dbmdzIntroduction
BERTurk is a cased BERT model for the Turkish language, developed by the MDZ Digital Library team at the Bavarian State Library. It is a community-driven project utilizing datasets from the Turkish NLP community.
Architecture
The BERTurk model architecture is based on the BERT model, adapted for the Turkish language. The model is trained using a filtered and sentence-segmented version of the Turkish OSCAR corpus, a recent Wikipedia dump, various OPUS corpora, and a special corpus provided by Kemal Oflazer.
Training
The training corpus comprises 35GB and 44,049,766,662 tokens. Training was conducted on Google’s TensorFlow Research Cloud, utilizing a TPU v3-8 for 2 million steps. Currently, the model supports PyTorch-Transformers compatible weights.
Guide: Running Locally
-
Install the Transformers Library
Ensure you have thetransformers
library installed:pip install transformers
-
Load the Model and Tokenizer
Use the following code snippet to load the BERTurk model and tokenizer:from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased") model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
-
Using Cloud GPUs
For optimal training performance, consider using cloud-based GPU services such as Google Colab, AWS, or Azure.
License
The BERTurk model is distributed under the MIT License.