bert base german europeana uncased

dbmdz

Introduction

The BERT-BASE-GERMAN-EUROPEANA-UNCASED model is developed by the MDZ Digital Library team at the Bavarian State Library. It offers open-source German Europeana BERT models trained on the Europeana newspapers dataset.

Architecture

The model is based on BERT architecture and is specifically designed for processing historical German text. It is compatible with PyTorch-Transformers and is built to handle a large corpus consisting of 8,035,986,369 tokens from a 51GB dataset.

Training

The training corpus comprises data from the open-source Europeana newspapers provided by The European Library. Detailed information on data and pretraining steps can be accessed from the Europeana BERT repository.

Guide: Running Locally

  1. Installation: Ensure you have Python installed along with the Transformers library version 2.3 or higher. Install via pip:

    pip install transformers
    
  2. Loading the Model: Use the following Python code to load the model and tokenizer:

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-europeana-uncased")
    model = AutoModel.from_pretrained("dbmdz/bert-base-german-europeana-uncased")
    
  3. Computational Resources: For efficient training and inference, consider utilizing cloud GPUs such as those from AWS, Google Cloud, or Azure.

License

The model is released under the MIT license, allowing for broad usage and modification rights.

More Related APIs