ko sroberta nli

jhgan

Introduction

KO-SROBERTA-NLI is a sentence-transformers model designed for mapping sentences and paragraphs into a 768-dimensional dense vector space. This model is effective for tasks like clustering and semantic search, especially for Korean language texts.

Architecture

The model architecture consists of:

  • Transformer: Based on the RobertaModel, configured with a maximum sequence length of 128 and without applying lowercasing.
  • Pooling: Utilizes mean pooling to aggregate token embeddings into sentence embeddings, with a word embedding dimension of 768.

Training

The model was trained using:

  • DataLoader: Custom NoDuplicatesDataLoader with a batch size of 64.
  • Loss Function: MultipleNegativesRankingLoss, using cosine similarity with a scale of 20.0.
  • Training Parameters:
    • One epoch
    • Evaluation steps every 1000 steps
    • Optimizer: AdamW with learning rate (2 \times 10^{-5})
    • Scheduler: WarmupLinear
    • Weight decay of 0.01
    • Warmup steps set to 889

Guide: Running Locally

Basic Steps

  1. Install Dependencies:

    pip install -U sentence-transformers
    
  2. Usage with Sentence-Transformers:

    from sentence_transformers import SentenceTransformer
    sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]
    model = SentenceTransformer('jhgan/ko-sroberta-nli')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Usage with Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('jhgan/ko-sroberta-nli')
    model = AutoModel.from_pretrained('jhgan/ko-sroberta-nli')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Suggest Cloud GPUs

For performance improvements, consider using cloud services offering GPUs such as AWS EC2, Google Cloud Platform, or Azure.

License

Details about the license are not specified in the provided content. Please refer to the official Hugging Face repository or contact the model authors for licensing information.

More Related APIs in Sentence Similarity