bert base german europeana cased
dbmdzIntroduction
The German Europeana BERT model is an open-source language model developed by the MDZ Digital Library team at the Bavarian State Library. This model is tailored for processing German language text, specifically from historic sources, using data sourced from the Europeana newspapers collection.
Architecture
The model is based on the BERT architecture and is trained using a corpus consisting of 51GB of data, amounting to over 8 billion tokens. It is designed to work with PyTorch-Transformers and is currently available only in PyTorch-compatible weights.
Training
The training data for this model comes from Europeana newspapers, provided by The European Library. Detailed information about the data preparation and pretraining processes is available here.
Guide: Running Locally
To run the German Europeana BERT model locally, you need to install the Hugging Face Transformers library version 2.3 or higher. Here are the basic steps:
-
Installation:
pip install transformers
-
Load the Model:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-europeana-cased") model = AutoModel.from_pretrained("dbmdz/bert-base-german-europeana-cased")
-
Cloud GPU: For more efficient training and inference, consider using a cloud GPU service such as AWS EC2, Google Cloud Compute Engine, or Azure.
License
The German Europeana BERT model is released under the MIT license, allowing for wide-ranging use and modification.