roberta ko small tsdae

smartmind

Introduction

The ROBERTA-KO-SMALL-TSDAE model by SMARTMIND is a sentence-transformers model that transforms sentences and paragraphs into 256-dimensional dense vectors. It is designed for tasks such as clustering and semantic search and is specifically tailored for the Korean language.

Architecture

The model is a small Korean version of RoBERTa, pretrained using a technique called TSDAE, as detailed in the paper arxiv:2104.06979. Its architecture is similar to lassl/roberta-ko-small, with a different tokenizer. The model includes a Transformer layer and a Pooling layer configured for CLS token pooling.

Training

The model was evaluated using the KLUE STS dataset and achieved good performance metrics without fine-tuning. The evaluation results for various metrics include cosine, euclidean, and manhattan distances in both Pearson and Spearman correlations.

Guide: Running Locally

Basic Steps

  1. Install Sentence-Transformers:

    pip install -U sentence-transformers
    
  2. Load and Use the Model:

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('smartmind/roberta-ko-small-tsdae')
    sentences = ["This is an example sentence", "Each sentence is converted"]
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:
    Install the transformers library and use the following code to load the model without sentence-transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae')
    model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    def cls_pooling(model_output, attention_mask):
        return model_output[0][:,0]
    
    sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
    print(sentence_embeddings)
    

Suggest Cloud GPUs

For optimal performance, especially with large datasets or real-time applications, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure for running the model.

License

This model is licensed under the MIT License, allowing for wide usage and distribution.

More Related APIs in Sentence Similarity