sbert chinese general v1

DMetaSoul

Introduction

The SBERT-CHINESE-GENERAL-V1 model by DMetaSoul is a Chinese version of Sentence-BERT based on the bert-base-chinese model. It is trained on datasets such as NLI, PAWS-X, PKU-Paraphrase-Bank, and STS and is suitable for general semantic matching tasks like text feature extraction, text vector clustering, and semantic search.

Architecture

SBERT-CHINESE-GENERAL-V1 is built on the BERT architecture, specifically the bert-base-chinese model. It is optimized for tasks involving semantic similarity and feature extraction, although it may exhibit overfitting in some scenarios.

Training

This model was trained using semantic similarity datasets, including NLI, PAWS-X, PKU-Paraphrase-Bank, and STS. It performs well on the Chinese-STS task but may not be optimal for other tasks due to overfitting risks.

Guide: Running Locally

Using Sentence-Transformers

  1. Install the sentence-transformers package:

    pip install -U sentence-transformers
    
  2. Load the model and extract embeddings:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
    model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v1')
    embeddings = model.encode(sentences)
    print(embeddings)
    

Using Hugging Face Transformers

  1. Load the model and extract embeddings:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
    
    tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v1')
    model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v1')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Cloud GPUs

For improved performance, consider using cloud GPUs provided by services such as AWS, Google Cloud, or Azure.

License

The model is licensed under the Apache-2.0 license.

More Related APIs in Sentence Similarity