sbert base chinese nli

uer

Introduction

The SBERT-BASE-CHINESE-NLI model is a pre-trained sentence embedding model designed for sentence similarity tasks in Chinese, utilizing Sentence-BERT architecture. It was developed using the UER-py framework and can also be pre-trained with TencentPretrain.

Architecture

The model is based on the Sentence-BERT architecture, utilizing a pre-trained chinese_roberta_L-12_H-768 model. It is designed for feature extraction and sentence similarity tasks, using cosine similarity to compare sentence embeddings.

Training

The SBERT-BASE-CHINESE-NLI was fine-tuned on the ChineseTextualInference dataset using the UER-py framework. Training occurred over five epochs with a sequence length of 128. The training process included saving the model at the end of each epoch when the best performance on the development set was achieved.

python3 finetune/run_classifier_siamese.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
                                           --vocab_path models/google_zh_vocab.txt \
                                           --config_path models/sbert/base_config.json \
                                           --train_path datasets/ChineseTextualInference/train.tsv \
                                           --dev_path datasets/ChineseTextualInference/dev.tsv \
                                           --learning_rate 5e-5 --epochs_num 5 --batch_size 64

The pre-trained model is then converted to Hugging Face's format:

python3 scripts/convert_sbert_from_uer_to_huggingface.py --input_model_path models/finetuned_model.bin \
                                                         --output_model_path pytorch_model.bin \
                                                         --layers_num 12

Guide: Running Locally

To run the SBERT-BASE-CHINESE-NLI model locally, follow these steps:

  1. Install the sentence-transformers library:

    pip install sentence-transformers
    
  2. Load the model and encode sentences:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('uer/sbert-base-chinese-nli')
    sentences = ['那个人很开心', '那个人非常开心']
    sentence_embeddings = model.encode(sentences)
    
  3. Calculate cosine similarity:

    from sklearn.metrics.pairwise import paired_cosine_distances
    cosine_score = 1 - paired_cosine_distances([sentence_embeddings[0]], [sentence_embeddings[1]])
    

For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The SBERT-BASE-CHINESE-NLI model is released under the Apache License 2.0.

More Related APIs in Sentence Similarity