bert base turkish cased

dbmdz

Introduction

BERTurk is a cased BERT model for the Turkish language, developed by the MDZ Digital Library team at the Bavarian State Library. It is a community-driven project utilizing datasets from the Turkish NLP community.

Architecture

The BERTurk model architecture is based on the BERT model, adapted for the Turkish language. The model is trained using a filtered and sentence-segmented version of the Turkish OSCAR corpus, a recent Wikipedia dump, various OPUS corpora, and a special corpus provided by Kemal Oflazer.

Training

The training corpus comprises 35GB and 44,049,766,662 tokens. Training was conducted on Google’s TensorFlow Research Cloud, utilizing a TPU v3-8 for 2 million steps. Currently, the model supports PyTorch-Transformers compatible weights.

Guide: Running Locally

  1. Install the Transformers Library
    Ensure you have the transformers library installed:

    pip install transformers
    
  2. Load the Model and Tokenizer
    Use the following code snippet to load the BERTurk model and tokenizer:

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
    model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
    
  3. Using Cloud GPUs
    For optimal training performance, consider using cloud-based GPU services such as Google Colab, AWS, or Azure.

License

The BERTurk model is distributed under the MIT License.

More Related APIs