Ko Sim C S E bert

BM-K

Introduction

The KoSimCSE-BERT model is designed for Korean sentence embedding and can be used for feature extraction, text embeddings, and inference. This model is built upon the BERT architecture and supports environments for both using pre-trained models and training new models.

Architecture

KoSimCSE-BERT is based on the BERT architecture, optimized for Korean text. It is part of the Transformers library and supports PyTorch and Safetensors formats. The model facilitates feature extraction and is suited for tasks involving semantic textual similarity.

Training

The KoSimCSE-BERT model achieves high performance on semantic textual similarity tests, outperforming other models like KoSBERT and KoSentenceBART. It uses multi-tasking techniques to enhance its capabilities, as shown in its results across various metrics such as Cosine, Euclidean, and Manhattan distances.

Guide: Running Locally

  1. Environment Setup:

    • Install PyTorch and Transformers library.
    • Use Python for executing the model scripts.
  2. Code Execution:

    import torch
    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained('BM-K/KoSimCSE-bert')
    tokenizer = AutoTokenizer.from_pretrained('BM-K/KoSimCSE-bert')
    
    sentences = ['치타가 들판을 가로 질러 먹이를 쫓는다.',
                 '치타 한 마리가 먹이 뒤에서 달리고 있다.',
                 '원숭이 한 마리가 드럼을 연주한다.']
    
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    embeddings, _ = model(**inputs, return_dict=False)
    
  3. Suggestions for Cloud GPUs:

    • Use cloud services such as AWS, Google Cloud, or Azure to leverage GPU capabilities for faster model inference and training.

License

For license information, refer to the GitHub repository.

More Related APIs in Feature Extraction