bert fa base uncased

HooshvareLab

Introduction

ParsBERT is a monolingual language model designed for Persian language understanding, based on Google's BERT architecture. It is pre-trained on a diverse range of Persian corpora, including scientific texts, novels, and news articles, comprising over 3.9 million documents, 73 million sentences, and 1.3 billion words. ParsBERT is primarily intended for fine-tuning on downstream tasks such as masked language modeling and next sentence prediction.

Architecture

ParsBERT follows the Transformer-based architecture of BERT, tailored specifically for the Persian language. It incorporates a unique vocabulary and has been fine-tuned on extensive Persian datasets to improve its performance across various tasks.

Training

The model was trained using a large collection of Persian texts sourced from public corpora such as Persian Wikidumps and MirasText, as well as manually crawled data from diverse websites covering topics like science, lifestyle, travel, and more. Pre-processing involved POS tagging and WordPiece segmentation to standardize the data. After 300,000 steps, the model achieved a masked language modeling accuracy of 68.66% and a next sentence prediction accuracy of 100%.

Guide: Running Locally

Basic Steps

  1. Install Transformers Library: Ensure you have the transformers library installed.
  2. Load Model and Tokenizer:
    • TensorFlow:
      from transformers import AutoConfig, AutoTokenizer, TFAutoModel
      
      config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased")
      tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")
      model = TFAutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
      
    • PyTorch:
      from transformers import AutoConfig, AutoTokenizer, AutoModel
      
      config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased")
      tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")
      model = AutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
      
  3. Tokenize Text: Use the tokenizer to preprocess input text.
  4. Run Inference: Use the model to process the tokenized input.

Cloud GPUs

For efficient computation, especially during training or large-scale inference, consider using cloud GPU services like AWS EC2 with GPU instances, Google Cloud Platform, or NVIDIA's GPU cloud.

License

The ParsBERT model is released under the Apache 2.0 license, allowing for broad usage and modification with adherence to the license terms.

More Related APIs in Fill Mask