bert tiny historic multilingual cased
dbmdzIntroduction
The BERT-TINY-HISTORIC-MULTILINGUAL-CASED model is part of a collection designed for processing historical text in multiple languages. It supports German, French, English, Finnish, and Swedish, utilizing datasets from sources like Europeana and the British Library. These models are tailored for historic language processing, taking into account the noise present in older texts.
Architecture
The model is based on the BERT architecture, offering smaller variants such as Tiny, Mini, Small, and Medium. These compact models are inspired by research emphasizing the benefits of pre-training smaller models, enhancing efficiency and reducing computational costs.
Training
The multilingual models, including BERT-TINY, were pre-trained using substantial historical text corpora. For example, the German corpus was filtered using OCR confidence thresholds to optimize data quality, ultimately using a 0.6 confidence level, resulting in a 28GB dataset. Similarly, the French corpus was filtered to a 0.7 confidence level, producing a 27GB dataset. Models were trained on TPUs using varying configurations for different model sizes.
Pre-training involved calculating subword vocabularies, assessing subword fertility rates, and fine-tuning on named entity recognition (NER) corpora. The English model was trained separately using the Hugging Face JAX/FLAX implementation. The training process was supported by Google’s TPU Research Cloud and Hugging Face.
Guide: Running Locally
-
Setup Environment:
- Install Python and necessary libraries such as PyTorch or TensorFlow.
- Clone the repository from Hugging Face.
-
Download Model:
- Access the model through the Hugging Face Model Hub.
-
Run Pre-trained Model:
- Use the appropriate library (e.g.,
transformers
in Python) to load and run the model. - Example code snippet:
from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-tiny-historic-multilingual-cased") model = AutoModelForMaskedLM.from_pretrained("dbmdz/bert-tiny-historic-multilingual-cased")
- Use the appropriate library (e.g.,
-
Optimize for Performance:
- Consider using cloud GPUs for enhanced performance, such as AWS, Google Cloud, or Azure.
License
The BERT-TINY-HISTORIC-MULTILINGUAL-CASED model is licensed under the MIT License, allowing for extensive reuse with minimal restrictions.