sbert chinese general v2

DMetaSoul

Introduction

The SBERT-CHINESE-GENERAL-V2 is a model developed by DMetaSoul, designed for semantic similarity tasks in Chinese. It builds upon the bert-base-chinese model and is trained on the large-scale semantic similarity dataset, SimCLUE. This model is suitable for general semantic matching scenarios and demonstrates superior generalization across various tasks.

Architecture

SBERT-CHINESE-GENERAL-V2 is based on the BERT architecture, specifically utilizing the bert-base-chinese variant. It is optimized for sentence similarity and feature extraction tasks, supporting applications in semantic search and text embeddings inference.

Training

This model was trained using the SimCLUE dataset, which is a large dataset focused on semantic similarity in Chinese. The model has been evaluated across several public semantic matching datasets, demonstrating improved performance and generalization over its predecessor, SBERT-CHINESE-GENERAL-V1.

Guide: Running Locally

To use SBERT-CHINESE-GENERAL-V2 locally, follow these steps:

  1. Install Sentence-Transformers:

    Install the sentence-transformers library:

    pip install -U sentence-transformers
    
  2. Using Sentence-Transformers:

    Load the model and extract text embeddings with the following code:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
    model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v2')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    Alternatively, use the Hugging Face transformers library:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
    tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v2')
    model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v2')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    
  4. Suggest Cloud GPUs:

    For optimal performance, consider using a cloud-based GPU service such as AWS EC2, Google Cloud Platform, or Azure to handle the computational demands of the model.

License

The model is available under the terms specified by DMetaSoul, and users should refer to their official documentation or contact the developers for licensing details.

More Related APIs in Sentence Similarity