bert base turkish 128k uncased

dbmdz

Introduction

The BERTurk model is an uncased BERT model specifically designed for the Turkish language, developed by the MDZ Digital Library team at the Bavarian State Library. It is a community-driven project leveraging contributions from the Turkish NLP community.

Architecture

BERTurk utilizes an uncased BERT architecture with a vocabulary size of 128k. The model is trained on diverse Turkish corpora, including the Turkish OSCAR corpus, Wikipedia dumps, various OPUS corpora, and a special corpus provided by Kemal Oflazer. The training data amounts to 35GB and contains approximately 44 billion tokens.

Training

The model was trained on a TPU v3-8 provided by Google's TensorFlow Research Cloud for 2 million steps. Training was conducted using PyTorch-Transformers compatible weights. TensorFlow checkpoints are not currently available but can be requested.

Guide: Running Locally

To use the BERTurk model locally, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load the model and tokenizer in Python:

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-128k-uncased")
    model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-128k-uncased")
    
  3. For efficient local execution, especially for training or inference at scale, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The BERTurk model is open-sourced under the MIT license, allowing for extensive use, modification, and distribution.

More Related APIs