Erlangshen Roberta 330 M Similarity
IDEA-CCNLIntroduction
Erlangshen-Roberta-330M-Similarity is a fine-tuned version of the Chinese RoBERTa-wwm-ext-large model developed for similarity tasks. It has been trained on 20 Chinese paraphrase datasets, amounting to 2,773,880 samples, to enhance its performance in natural language understanding (NLU) applications.
Architecture
The model is based on the chinese-roberta-wwm-ext-large architecture. It is designed for tasks requiring semantic similarity assessment between Chinese text pairs. The model features 330 million parameters, making it suitable for a variety of NLU tasks.
Training
The model has been fine-tuned on a substantial dataset of 2,773,880 samples from 20 Chinese paraphrase datasets. This extensive training has resulted in improved performance metrics across various benchmarks, including BQ, BUSTM, and AFQMC, demonstrating its effectiveness in identifying semantic similarities in Chinese text.
Guide: Running Locally
To use Erlangshen-Roberta-330M-Similarity locally, follow these steps:
-
Install required libraries: Ensure that you have the
transformers
library andtorch
installed in your Python environment. -
Load the model and tokenizer:
from transformers import BertForSequenceClassification, BertTokenizer import torch tokenizer = BertTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-330M-Similarity') model = BertForSequenceClassification.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-330M-Similarity')
-
Prepare input data:
texta = '今天的饭不好吃' textb = '今天心情不好'
-
Make predictions:
output = model(torch.tensor([tokenizer.encode(texta, textb)])) print(torch.nn.functional.softmax(output.logits, dim=-1))
For improved performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.
License
The Erlangshen-Roberta-330M-Similarity model is licensed under the Apache 2.0 License. This allows for both personal and commercial usage, with attribution required.