bert base turkish uncased
dbmdzIntroduction
BERTurk is an uncased BERT model for Turkish developed by the MDZ Digital Library team at the Bavarian State Library. It is a community-driven project utilizing contributions from the Turkish NLP community. The model is designed to support various natural language processing tasks in the Turkish language.
Architecture
BERTurk is based on the BERT architecture, which is a transformer model designed for language representation. This specific model is uncased, meaning it does not distinguish between uppercase and lowercase letters, which is suitable for languages like Turkish where casing is less significant.
Training
The BERTurk model is trained on a combination of datasets including a filtered version of the Turkish OSCAR corpus, a recent Wikipedia dump, several OPUS corpora, and a corpus provided by Kemal Oflazer. The training dataset is approximately 35GB in size and contains over 44 billion tokens. Training was performed on a TPU v3-8 for 2 million steps, with support from Google's TensorFlow Research Cloud.
Guide: Running Locally
To use the BERTurk model locally, ensure you have Transformers
version 2.3 or higher. Here is a basic guide to loading the model:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased")
model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-uncased")
For optimal performance, especially with large datasets, consider using cloud GPUs such as those available from AWS, Google Cloud, or Azure.
License
The BERTurk model is released under the MIT License, allowing for broad use and distribution.