roberta base finetuned cluener2020 chinese

uer

Introduction

The Chinese RoBERTa-Base model is fine-tuned for Named Entity Recognition (NER) tasks, specifically utilizing the CLUENER2020 dataset. It has been developed using UER-py, with options for fine-tuning using TencentPretrain for larger models and multimodal frameworks.

Architecture

The model is based on the RoBERTa architecture, fine-tuned particularly for token classification tasks in the Chinese language. It utilizes a sequence length of 512 and is trained on the pre-trained chinese_roberta_L-12_H-768 model.

Training

Training Data

The model uses the CLUENER2020 dataset, which provides a fine-grained NER dataset for Chinese, employing only the training set for fine-tuning.

Training Procedure

The model is fine-tuned over five epochs with a batch size of 32 and a learning rate of 3e-5. The training is conducted on Tencent Cloud. The model's performance is evaluated at the end of each epoch, and the best-performing model on the development set is saved. The model is finally converted to Hugging Face's format.

python3 finetune/run_ner.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
                            --vocab_path models/google_zh_vocab.txt \
                            --train_path datasets/cluener2020/train.tsv \
                            --dev_path datasets/cluener2020/dev.tsv \
                            --label2id_path datasets/cluener2020/label2id.json \
                            --output_model_path models/cluener2020_ner_model.bin \
                            --learning_rate 3e-5 --epochs_num 5 --batch_size 32 --seq_length 512

Guide: Running Locally

To use the model locally:

  1. Install the Hugging Face Transformers library.
  2. Load the model and tokenizer using AutoModelForTokenClassification and AutoTokenizer.
  3. Use the pipeline for NER.
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained('uer/roberta-base-finetuned-cluener2020-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-cluener2020-chinese')
ner = pipeline('ner', model=model, tokenizer=tokenizer)
result = ner("江苏警方通报特斯拉冲进店铺")

For faster performance, it is recommended to use cloud GPU services from providers like AWS, Google Cloud, or Azure.

License

The use of this model and code is governed by the respective licenses of UER-py and TencentPretrain. Please refer to their repositories for specific licensing terms.

More Related APIs in Token Classification