sbert chinese general v2
DMetaSoulIntroduction
The SBERT-CHINESE-GENERAL-V2 is a model developed by DMetaSoul, designed for semantic similarity tasks in Chinese. It builds upon the bert-base-chinese
model and is trained on the large-scale semantic similarity dataset, SimCLUE. This model is suitable for general semantic matching scenarios and demonstrates superior generalization across various tasks.
Architecture
SBERT-CHINESE-GENERAL-V2 is based on the BERT architecture, specifically utilizing the bert-base-chinese
variant. It is optimized for sentence similarity and feature extraction tasks, supporting applications in semantic search and text embeddings inference.
Training
This model was trained using the SimCLUE dataset, which is a large dataset focused on semantic similarity in Chinese. The model has been evaluated across several public semantic matching datasets, demonstrating improved performance and generalization over its predecessor, SBERT-CHINESE-GENERAL-V1.
Guide: Running Locally
To use SBERT-CHINESE-GENERAL-V2 locally, follow these steps:
-
Install Sentence-Transformers:
Install the
sentence-transformers
library:pip install -U sentence-transformers
-
Using Sentence-Transformers:
Load the model and extract text embeddings with the following code:
from sentence_transformers import SentenceTransformer sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v2') embeddings = model.encode(sentences) print(embeddings)
-
Using Hugging Face Transformers:
Alternatively, use the Hugging Face
transformers
library:from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v2') model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v2') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
-
Suggest Cloud GPUs:
For optimal performance, consider using a cloud-based GPU service such as AWS EC2, Google Cloud Platform, or Azure to handle the computational demands of the model.
License
The model is available under the terms specified by DMetaSoul, and users should refer to their official documentation or contact the developers for licensing details.