sbert chinese general v1
DMetaSoulIntroduction
The SBERT-CHINESE-GENERAL-V1 model by DMetaSoul is a Chinese version of Sentence-BERT based on the bert-base-chinese
model. It is trained on datasets such as NLI, PAWS-X, PKU-Paraphrase-Bank, and STS and is suitable for general semantic matching tasks like text feature extraction, text vector clustering, and semantic search.
Architecture
SBERT-CHINESE-GENERAL-V1 is built on the BERT architecture, specifically the bert-base-chinese
model. It is optimized for tasks involving semantic similarity and feature extraction, although it may exhibit overfitting in some scenarios.
Training
This model was trained using semantic similarity datasets, including NLI, PAWS-X, PKU-Paraphrase-Bank, and STS. It performs well on the Chinese-STS task but may not be optimal for other tasks due to overfitting risks.
Guide: Running Locally
Using Sentence-Transformers
-
Install the
sentence-transformers
package:pip install -U sentence-transformers
-
Load the model and extract embeddings:
from sentence_transformers import SentenceTransformer sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v1') embeddings = model.encode(sentences) print(embeddings)
Using Hugging Face Transformers
-
Load the model and extract embeddings:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v1') model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v1') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
Cloud GPUs
For improved performance, consider using cloud GPUs provided by services such as AWS, Google Cloud, or Azure.
License
The model is licensed under the Apache-2.0 license.