K R S B E R T V40 K klue N L I aug S T S

snunlp

Introduction

The KR-SBERT-V40K-KLUENLI-AUGSTS model is a Korean-specific sentence-transformers model developed by SNUNLP. It is used for tasks such as clustering and semantic search by mapping sentences and paragraphs to a 768-dimensional dense vector space.

Architecture

The model architecture is based on SentenceTransformer, which includes:

  • A Transformer model (BertModel) with a maximum sequence length of 128 and no lower casing.
  • A pooling layer that performs mean pooling on token embeddings to generate sentence embeddings.

Training

The model was evaluated using the Sentence Embeddings Benchmark and achieved an accuracy of 0.8628. It is designed to handle various sentence similarity tasks, using data such as KR-SBERT-Medium-NLI-STS and KR-SBERT-V40K-NLI-augSTS.

Guide: Running Locally

  1. Install Dependencies:

    • For Sentence-Transformers:
      pip install -U sentence-transformers
      
    • For HuggingFace Transformers:
      pip install torch transformers
      
  2. Using Sentence-Transformers:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
    sentences = ["This is an example sentence", "Each sentence is converted"]
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using HuggingFace Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    tokenizer = AutoTokenizer.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
    model = AutoModel.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
    sentences = ['This is an example sentence', 'Each sentence is converted']
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    
  4. Cloud GPUs: Consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure for faster processing, especially when handling large datasets or performing extensive computations.

License

The KR-SBERT model is a publicly available resource published by SNUNLP. For citation and further details, refer to their GitHub repository: KR-SBERT GitHub.

More Related APIs in Sentence Similarity