tf xlm r ner 40 lang
jpluIntroduction
The model TF-XLM-R-NER-40-LANG is a fine-tuned version of XLM-Roberta designed for Named Entity Recognition (NER) across 40 languages. It identifies entities such as locations (LOC), organizations (ORG), and persons (PER). The model utilizes the XLM-Roberta-base architecture and is trained on the Wikiann dataset as part of the XTREME benchmark.
Architecture
The model is based on the XLM-Roberta architecture, which is a transformer-based model described in the paper arXiv:1911.02116. It supports multiple languages and is optimized for token classification tasks such as NER.
Training
The model was fine-tuned using the Wikiann dataset, which contains annotated text in 40 languages. Performance metrics, such as precision, recall, and F1-score, are calculated for each language, with macro and micro averages provided. For example, in English, the model achieves an F1-score of 0.83.
Guide: Running Locally
Basic Steps
- Download Data: Obtain the dataset from the XTREME repository.
- Setup Environment: Ensure you have Python installed, along with the
transformers
library from Hugging Face. - Run Training Script:
cd examples/ner python run_tf_ner.py \ --data_dir . \ --labels ./labels.txt \ --model_name_or_path jplu/tf-xlm-roberta-base \ --output_dir model \ --max-seq-length 128 \ --num_train_epochs 2 \ --per_gpu_train_batch_size 16 \ --per_gpu_eval_batch_size 32 \ --do_train \ --do_eval \ --logging_dir logs \ --mode token-classification \ --evaluate_during_training \ --optimizer_name adamw
- Inference with Pipelines:
from transformers import pipeline nlp_ner = pipeline( "ner", model="jplu/tf-xlm-r-ner-40-lang", tokenizer=('jplu/tf-xlm-r-ner-40-lang', {"use_fast": True}), framework="tf" ) text = "Barack Obama was born in Hawaii." print(nlp_ner(text))
Cloud GPUs
For improved performance and reduced training time, consider using cloud-based GPUs from providers such as AWS, Google Cloud, or Azure.
License
The model and associated code are available under the Apache License 2.0. Usage must comply with this license, ensuring attribution and permissible use.