bert base german europeana cased

dbmdz

Introduction

The German Europeana BERT model is an open-source language model developed by the MDZ Digital Library team at the Bavarian State Library. This model is tailored for processing German language text, specifically from historic sources, using data sourced from the Europeana newspapers collection.

Architecture

The model is based on the BERT architecture and is trained using a corpus consisting of 51GB of data, amounting to over 8 billion tokens. It is designed to work with PyTorch-Transformers and is currently available only in PyTorch-compatible weights.

Training

The training data for this model comes from Europeana newspapers, provided by The European Library. Detailed information about the data preparation and pretraining processes is available here.

Guide: Running Locally

To run the German Europeana BERT model locally, you need to install the Hugging Face Transformers library version 2.3 or higher. Here are the basic steps:

  1. Installation:

    pip install transformers
    
  2. Load the Model:

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-europeana-cased")
    model = AutoModel.from_pretrained("dbmdz/bert-base-german-europeana-cased")
    
  3. Cloud GPU: For more efficient training and inference, consider using a cloud GPU service such as AWS EC2, Google Cloud Compute Engine, or Azure.

License

The German Europeana BERT model is released under the MIT license, allowing for wide-ranging use and modification.

More Related APIs