bert fa base uncased
HooshvareLabIntroduction
ParsBERT is a monolingual language model designed for Persian language understanding, based on Google's BERT architecture. It is pre-trained on a diverse range of Persian corpora, including scientific texts, novels, and news articles, comprising over 3.9 million documents, 73 million sentences, and 1.3 billion words. ParsBERT is primarily intended for fine-tuning on downstream tasks such as masked language modeling and next sentence prediction.
Architecture
ParsBERT follows the Transformer-based architecture of BERT, tailored specifically for the Persian language. It incorporates a unique vocabulary and has been fine-tuned on extensive Persian datasets to improve its performance across various tasks.
Training
The model was trained using a large collection of Persian texts sourced from public corpora such as Persian Wikidumps and MirasText, as well as manually crawled data from diverse websites covering topics like science, lifestyle, travel, and more. Pre-processing involved POS tagging and WordPiece segmentation to standardize the data. After 300,000 steps, the model achieved a masked language modeling accuracy of 68.66% and a next sentence prediction accuracy of 100%.
Guide: Running Locally
Basic Steps
- Install Transformers Library: Ensure you have the
transformers
library installed. - Load Model and Tokenizer:
- TensorFlow:
from transformers import AutoConfig, AutoTokenizer, TFAutoModel config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased") tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased") model = TFAutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
- PyTorch:
from transformers import AutoConfig, AutoTokenizer, AutoModel config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased") tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased") model = AutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
- TensorFlow:
- Tokenize Text: Use the tokenizer to preprocess input text.
- Run Inference: Use the model to process the tokenized input.
Cloud GPUs
For efficient computation, especially during training or large-scale inference, consider using cloud GPU services like AWS EC2 with GPU instances, Google Cloud Platform, or NVIDIA's GPU cloud.
License
The ParsBERT model is released under the Apache 2.0 license, allowing for broad usage and modification with adherence to the license terms.