bert base german cased
google-bertIntroduction
The BERT-Base German Cased model is a language model developed for German language processing. It was trained using various German datasets, including Wikipedia, OpenLegalData, and news articles. The model is designed to handle tasks such as named entity recognition (NER) and document classification.
Architecture
The model is based on the BERT-base-cased architecture, optimized for the German language. The training infrastructure utilized a single TPU v2.
Training
The model was trained on approximately 12GB of data for 810,000 steps with a batch size of 1024, using a sequence length of 128 initially, followed by 30,000 steps with a sequence length of 512. Training employed the Tensorflow framework and took about 9 days. The data was processed using tailored scripts for cleaning and sentence segmentation, leveraging the sentencepiece library for vocabulary creation. Performance was evaluated on several German datasets, demonstrating stable learning and competitive performance across tasks.
Guide: Running Locally
- Prerequisites: Ensure you have Python installed along with libraries like Tensorflow or PyTorch.
- Clone the Repository: Download the model files from the Hugging Face repository.
- Setup Environment: Install necessary dependencies using pip.
- Load the Model: Use the Hugging Face Transformers library to load and run the model.
- Inference: Input text data to perform tasks like fill-mask, NER, or classification.
For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The BERT-Base German Cased model is licensed under the MIT License, which allows for broad use and distribution.