bert base german uncased

dbmdz

Introduction

The BERT-BASE-GERMAN-UNCASED model is a German-language BERT model developed by the MDZ Digital Library team at the Bavarian State Library. It provides an open-source resource for various natural language processing tasks in German, leveraging a comprehensive dataset to ensure robust performance.

Architecture

The model architecture follows the standard BERT setup, utilizing a transformer-based approach to process and understand the German language. It is available in both cased and uncased versions, catering to different use cases and preferences for handling text data.

Training

The BERT-BASE-GERMAN-UNCASED model was trained on a dataset comprising a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl, and News Crawl, totaling 16GB and 2,350,234,427 tokens. Preprocessing involved sentence splitting with SpaCy and vocabulary generation using a sentence piece model, similar to the SciBERT training process. Training was conducted with an initial sequence length of 512 subwords over 1.5 million steps. The model's weights are compatible with PyTorch-Transformers.

Guide: Running Locally

To use the German BERT models locally, you can load them with the Transformers library version 2.3 or higher. Here are the basic steps:

  1. Install Transformers Library:
    pip install transformers
    
  2. Load the Model:
    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
    model = AutoModel.from_pretrained("dbmdz/bert-base-german-cased")
    
  3. Run Inference: Use the tokenizer and model to process and analyze text data.

For enhanced performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure, which can accelerate the model's inference and training processes.

License

The BERT-BASE-GERMAN-UNCASED model is released under the MIT License, allowing for flexibility in usage and modification.

More Related APIs in Fill Mask