bert base german uncased
dbmdzIntroduction
The BERT-BASE-GERMAN-UNCASED model is a German-language BERT model developed by the MDZ Digital Library team at the Bavarian State Library. It provides an open-source resource for various natural language processing tasks in German, leveraging a comprehensive dataset to ensure robust performance.
Architecture
The model architecture follows the standard BERT setup, utilizing a transformer-based approach to process and understand the German language. It is available in both cased and uncased versions, catering to different use cases and preferences for handling text data.
Training
The BERT-BASE-GERMAN-UNCASED model was trained on a dataset comprising a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl, and News Crawl, totaling 16GB and 2,350,234,427 tokens. Preprocessing involved sentence splitting with SpaCy and vocabulary generation using a sentence piece model, similar to the SciBERT training process. Training was conducted with an initial sequence length of 512 subwords over 1.5 million steps. The model's weights are compatible with PyTorch-Transformers.
Guide: Running Locally
To use the German BERT models locally, you can load them with the Transformers library version 2.3 or higher. Here are the basic steps:
- Install Transformers Library:
pip install transformers
- Load the Model:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased") model = AutoModel.from_pretrained("dbmdz/bert-base-german-cased")
- Run Inference: Use the tokenizer and model to process and analyze text data.
For enhanced performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure, which can accelerate the model's inference and training processes.
License
The BERT-BASE-GERMAN-UNCASED model is released under the MIT License, allowing for flexibility in usage and modification.