tf xlm r ner 40 lang

jplu

Introduction

The model TF-XLM-R-NER-40-LANG is a fine-tuned version of XLM-Roberta designed for Named Entity Recognition (NER) across 40 languages. It identifies entities such as locations (LOC), organizations (ORG), and persons (PER). The model utilizes the XLM-Roberta-base architecture and is trained on the Wikiann dataset as part of the XTREME benchmark.

Architecture

The model is based on the XLM-Roberta architecture, which is a transformer-based model described in the paper arXiv:1911.02116. It supports multiple languages and is optimized for token classification tasks such as NER.

Training

The model was fine-tuned using the Wikiann dataset, which contains annotated text in 40 languages. Performance metrics, such as precision, recall, and F1-score, are calculated for each language, with macro and micro averages provided. For example, in English, the model achieves an F1-score of 0.83.

Guide: Running Locally

Basic Steps

  1. Download Data: Obtain the dataset from the XTREME repository.
  2. Setup Environment: Ensure you have Python installed, along with the transformers library from Hugging Face.
  3. Run Training Script:
    cd examples/ner
    python run_tf_ner.py \
    --data_dir . \
    --labels ./labels.txt \
    --model_name_or_path jplu/tf-xlm-roberta-base \
    --output_dir model \
    --max-seq-length 128 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 16 \
    --per_gpu_eval_batch_size 32 \
    --do_train \
    --do_eval \
    --logging_dir logs \
    --mode token-classification \
    --evaluate_during_training \
    --optimizer_name adamw
    
  4. Inference with Pipelines:
    from transformers import pipeline
    
    nlp_ner = pipeline(
        "ner",
        model="jplu/tf-xlm-r-ner-40-lang",
        tokenizer=('jplu/tf-xlm-r-ner-40-lang', {"use_fast": True}),
        framework="tf"
    )
    
    text = "Barack Obama was born in Hawaii."
    print(nlp_ner(text))
    

Cloud GPUs

For improved performance and reduced training time, consider using cloud-based GPUs from providers such as AWS, Google Cloud, or Azure.

License

The model and associated code are available under the Apache License 2.0. Usage must comply with this license, ensuring attribution and permissible use.

More Related APIs in Token Classification