distilbert base german europeana cased LLM Model

Introduction

The DistilBERT-base German Europeana cased model is developed by the MDZ Digital Library team at the Bavarian State Library. This model is designed for processing German text, particularly from historic contexts, using the DistilBERT architecture.

Architecture

The model uses a DistilBERT architecture, which is a lighter and faster version of BERT. It is trained specifically on the Europeana newspapers corpus, which encompasses 51GB of data and includes over 8 billion tokens. This corpus is provided by The European Library and focuses on historic German text.

Training

Detailed information on the data used and the pretraining steps can be found in the associated GitHub repository. The model is optimized for tasks like Historic Named Entity Recognition (NER).

Guide: Running Locally

To use the German Europeana DistilBERT model, ensure you have Transformers version 4.3 or above. You can load the model using the following code:

from transformers import AutoModel, AutoTokenizer

model_name = "dbmdz/distilbert-base-german-europeana-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

For enhanced performance, consider using a cloud GPU service such as AWS, Google Cloud Platform, or Azure.

License

This model is released under the MIT License, allowing for broad usage and modification.

More Related APIs