literary german bert

severinsimmler

German BERT for Literary Texts

Introduction

The Literary-German-BERT model is derived from the bert-base-german-dbmdz-cased model and fine-tuned specifically for literary texts. It has undergone two phases of fine-tuning: first, on the Corpus of German-Language Fiction for language modeling, and second, for named entity recognition using the DROC corpus to identify protagonists in German novels.

Architecture

This model is based on the BERT architecture, utilizing its transformer-based approach to process German-language literary texts. The underlying architecture allows for effective handling of token classification tasks, supported by libraries such as PyTorch and JAX.

Training

Language Modeling

  • Dataset: Corpus of German-Language Fiction, containing 3,194 documents with over 203 million tokens.
  • Time Span: Texts range from the 18th to the 20th century.
  • Performance: After one epoch of training, the fine-tuned model achieved a perplexity of 4.98 compared to the vanilla BERT's 6.82.

Named Entity Recognition

  • Dataset: DROC corpus with 10,799 sentences for training, validated on 547, and tested on 1,845 sentences.
  • Labels: B-PER, I-PER, and O.
  • Performance: Achieved F1 scores of 91.6 on the development set and 93.8 on the test set.
  • Cross-validation: Compared to a Conditional Random Field baseline, using 10-fold cross-validation.

Guide: Running Locally

  1. Prerequisites:

    • Install the Hugging Face Transformers library.
    • Ensure Python and PyTorch or JAX are set up in your environment.
  2. Clone the Repository:

    git clone https://huggingface.co/severinsimmler/literary-german-bert
    
  3. Install Dependencies:

    pip install -r requirements.txt
    
  4. Load the Model: Use the Transformers library to load and utilize the model for token classification tasks.

  5. Cloud GPUs: For enhanced performance, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure.

License

The model does not specify a license in the provided information. For usage and redistribution permissions, refer to the model's page on Hugging Face or contact the model creator directly.

More Related APIs in Token Classification