longformer gottbert base 8192 aw512
LennartKellerIntroduction
This model is a fine-tuned version of the Longformer architecture, specifically adapted for German text using the OSCAR dataset. It is designed for feature extraction with a focus on processing long sequences of text efficiently.
Architecture
The Longformer-GottBERT-Base-8192-AW512 model is built upon the German version of the RoBERTa architecture, known as GottBERT. It features local attention windows with a fixed size of 512 tokens across all layers and supports a maximum sequence length of 8192 tokens. This configuration allows for the processing of lengthy texts by combining local and global attention mechanisms.
Training
The model was trained on a 500 million token subset of the German portion of the OSCAR dataset, sourced from the 2017 Common Crawl. The training employed masked language modeling over 3 epochs with the following hyperparameters:
- Learning Rate: 3e-05
- Train Batch Size: 2
- Eval Batch Size: 4
- Seed: 42
- Gradient Accumulation Steps: 8
- Total Train Batch Size: 16
- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- LR Scheduler Type: Linear
- Warmup Steps: 500
- Mixed Precision Training: Native AMP
The model's performance was validated using 5% of the original dataset subset.
Guide: Running Locally
To run the model locally, follow these steps:
- Install Required Libraries: Ensure you have
Transformers
,PyTorch
,Datasets
, andTokenizers
installed. - Download the Model: Clone or download the model from its Hugging Face model card.
- Set Up Environment: Configure your Python environment to match the framework versions:
- Transformers 4.15.0
- PyTorch 1.10.1+cu113
- Datasets 1.17.0
- Tokenizers 0.10.3
- Load the Model: Use the Transformers library to load the model and tokenizer.
- Inference: Run inference on your input data as needed.
For improved performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
The licensing information for the model and dataset is not provided in the document. Users should refer to the Hugging Face model card and the OSCAR dataset's website for specific licensing details and terms of use.