convbert base turkish mc4 cased

dbmdz

Introduction

The ConvBERT-Base-Turkish model, developed by DBMDZ, is a community-driven language model for Turkish. It is based on ConvBERT architecture and was trained using the Turkish portion of the multilingual C4 (mC4) corpus. This initiative is part of a broader effort to provide robust language models for Turkish, including contributions from the Turkish NLP community.

Architecture

The model is based on ConvBERT, a variant of BERT that incorporates convolutional operations for improved efficiency. It uses a sequence length of 512 tokens and was trained for 1 million steps on a v3-32 TPU. The model utilizes the original 32k vocabulary from the mC4 corpus.

Training

The training corpus, derived from the Turkish section of the mC4 corpus, comprises 242GB of data and 31,240,963,926 tokens. Training was conducted on Google's TensorFlow Research Cloud using TPUs. The model benefits from datasets and corpora contributed by the Turkish NLP community.

Guide: Running Locally

To use the ConvBERT-Base-Turkish model locally, follow these steps:

  1. Install Transformers Library:

    pip install transformers
    
  2. Load the Model:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/convbert-base-turkish-mc4-cased")
    model = AutoModel.from_pretrained("dbmdz/convbert-base-turkish-mc4-cased")
    

For optimal performance, consider using cloud GPUs such as NVIDIA's A100 available via cloud providers like AWS or Google Cloud.

License

The ConvBERT-Base-Turkish model is licensed under the MIT License. This allows for wide usage and distribution, adhering to the open-access principles of the Hugging Face model hub.

More Related APIs in Fill Mask