distilbert base german europeana cased
dbmdzIntroduction
The DistilBERT-base German Europeana cased model is developed by the MDZ Digital Library team at the Bavarian State Library. This model is designed for processing German text, particularly from historic contexts, using the DistilBERT architecture.
Architecture
The model uses a DistilBERT architecture, which is a lighter and faster version of BERT. It is trained specifically on the Europeana newspapers corpus, which encompasses 51GB of data and includes over 8 billion tokens. This corpus is provided by The European Library and focuses on historic German text.
Training
Detailed information on the data used and the pretraining steps can be found in the associated GitHub repository. The model is optimized for tasks like Historic Named Entity Recognition (NER).
Guide: Running Locally
To use the German Europeana DistilBERT model, ensure you have Transformers
version 4.3 or above. You can load the model using the following code:
from transformers import AutoModel, AutoTokenizer
model_name = "dbmdz/distilbert-base-german-europeana-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
For enhanced performance, consider using a cloud GPU service such as AWS, Google Cloud Platform, or Azure.
License
This model is released under the MIT License, allowing for broad usage and modification.