sup Sim C S E Viet Namese phobert base LLM Model

Introduction

SIMECSE_VIETNAMESE is an advanced sentence similarity model specifically designed for Vietnamese. It employs a contrastive learning approach, building upon SimCSE, to effectively generate sentence embeddings. The model leverages pre-trained language models like PhoBERT to encode input sentences and is capable of processing both labeled and unlabeled data.

Architecture

The model architecture is based on the PhoBERT model, a pre-trained language model tailored for Vietnamese. The supervised version of SimCSE_Vietnamese, sup-SimCSE-VietNamese-phobert-base, consists of 135 million parameters and is designed to enhance the performance of sentence embeddings in Vietnamese contexts.

Training

SIMECSE_VIETNAMESE uses a contrastive learning strategy to refine sentence embeddings. This methodology enhances the robustness and accuracy of the model by optimizing the pre-training process with the PhoBERT architecture. The training process is fortified by leveraging both labeled and unlabeled datasets to improve the quality of sentence representations.

Guide: Running Locally

Installation

To use SIMECSE_VIETNAMESE with different transformers, follow these steps:

For Sentence-Transformers:

Install dependencies:

pip install -U sentence-transformers
pip install pyvi

Use the model:

from sentence_transformers import SentenceTransformer
from pyvi.ViTokenizer import tokenize

model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base')
sentences = ['Your Vietnamese sentences here']
sentences = [tokenize(sentence) for sentence in sentences]
embeddings = model.encode(sentences)

For Transformers:

Install dependencies:

pip install -U transformers
pip install pyvi

Use the model:

import torch
from transformers import AutoModel, AutoTokenizer
from pyvi.ViTokenizer import tokenize

PhobertTokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")

sentences = ['Your Vietnamese sentences here']
sentences = [tokenize(sentence) for sentence in sentences]
inputs = PhobertTokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

Cloud GPUs

For better performance, consider using cloud GPUs such as those available through Google Colab, AWS, or Azure. A quick start is available through Colab here.

License

The usage of SIMECSE_VIETNAMESE and related resources is subject to the licenses specified by the authors and the respective repositories from which the models and datasets are obtained. Always refer to the specific licensing terms associated with the model or dataset you are using.

More Related APIs in Sentence Similarity