roberta fa zwnj base ner

HooshvareLab

Introduction

The ROBERTA-FA-ZWNJ-BASE-NER model is a fine-tuned version of the Roberta model for Named Entity Recognition (NER) in Persian text. It is trained on datasets such as ARMAN, PEYMA, and WikiANN, covering ten types of entities including Date, Event, Facility, Location, Money, Organization, Percent, Person, Product, and Time.

Architecture

This model is based on the Roberta architecture, which is a robustly optimized BERT approach. It is compatible with libraries like PyTorch, TensorFlow, and JAX, and supports token classification tasks.

Training

The model was trained on a combined dataset comprising ARMAN, PEYMA, and WikiANN, with specific entity types labeled. The training dataset included a variety of Persian text records, with specific breakdowns for training, validation, and testing datasets. Evaluation metrics include an overall accuracy of 99.48% and a F1 score of 95.50%, with detailed precision and recall metrics for each entity type.

Guide: Running Locally

Basic Steps

  1. Install Transformers Library

    pip install transformers
    
  2. Load Model and Tokenizer

    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    model_name_or_path = "HooshvareLab/roberta-fa-zwnj-base-ner"
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)  # For PyTorch
    # For TensorFlow
    # model = TFAutoModelForTokenClassification.from_pretrained(model_name_or_path)
    
    nlp = pipeline("ner", model=model, tokenizer=tokenizer)
    example = "در سال ۲۰۱۳ درگذشت و آندرتیکر و کین برای او مراسم یادبود گرفتند."
    ner_results = nlp(example)
    print(ner_results)
    
  3. Run Predictions
    Use the provided example code to perform NER on Persian text.

Suggest Cloud GPUs

For efficient model execution, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Microsoft Azure.

License

For questions or issues, you can post on the ParsNER Issues GitHub repository. Licensing details are typically included in the repository documentation.

More Related APIs in Token Classification