bert base parsbert ner uncased
HooshvareLabIntroduction
ParsBERT is a monolingual language model for Persian, built on Google's BERT architecture. It is designed for tasks such as named entity recognition (NER), where it labels entities like names, locations, and organizations in Persian text. ParsBERT utilizes datasets such as ARMAN and PEYMA, applying BERT's whole word masking technique while remaining uncased.
Architecture
ParsBERT mirrors the BERT-Base configuration, focusing on Persian language understanding. It supports various frameworks, including PyTorch, TensorFlow, and JAX, tailored for token classification tasks in Persian.
Training
The model is trained on Persian NER tasks using datasets like ARMAN and PEYMA. These datasets are formatted in IOB tagging to facilitate multi-class token classification. The PEYMA dataset comprises 7,145 sentences, while ARMAN includes 7,682 sentences, each tagged with multiple entity classes.
Guide: Running Locally
To run ParsBERT locally:
- Install Dependencies: Ensure you have Python and the Hugging Face Transformers library installed.
- Load the Model: Use the Transformers library to load ParsBERT.
- Prepare Data: Use Persian text data compatible with the model's token classification task format.
- Inference: Run inference using the model to classify tokens in the input text.
For optimal performance, consider using cloud GPU services such as Google Colab or AWS EC2 with GPU support.
License
ParsBERT is released under the Apache 2.0 license, permitting free use, modification, and distribution with proper attribution.