sup Sim C S E Viet Namese phobert base
VoVanPhucIntroduction
SIMECSE_VIETNAMESE is an advanced sentence similarity model specifically designed for Vietnamese. It employs a contrastive learning approach, building upon SimCSE, to effectively generate sentence embeddings. The model leverages pre-trained language models like PhoBERT to encode input sentences and is capable of processing both labeled and unlabeled data.
Architecture
The model architecture is based on the PhoBERT model, a pre-trained language model tailored for Vietnamese. The supervised version of SimCSE_Vietnamese, sup-SimCSE-VietNamese-phobert-base
, consists of 135 million parameters and is designed to enhance the performance of sentence embeddings in Vietnamese contexts.
Training
SIMECSE_VIETNAMESE uses a contrastive learning strategy to refine sentence embeddings. This methodology enhances the robustness and accuracy of the model by optimizing the pre-training process with the PhoBERT architecture. The training process is fortified by leveraging both labeled and unlabeled datasets to improve the quality of sentence representations.
Guide: Running Locally
Installation
To use SIMECSE_VIETNAMESE with different transformers, follow these steps:
-
For Sentence-Transformers:
- Install dependencies:
pip install -U sentence-transformers pip install pyvi
- Use the model:
from sentence_transformers import SentenceTransformer from pyvi.ViTokenizer import tokenize model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base') sentences = ['Your Vietnamese sentences here'] sentences = [tokenize(sentence) for sentence in sentences] embeddings = model.encode(sentences)
- Install dependencies:
-
For Transformers:
- Install dependencies:
pip install -U transformers pip install pyvi
- Use the model:
import torch from transformers import AutoModel, AutoTokenizer from pyvi.ViTokenizer import tokenize PhobertTokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base") model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base") sentences = ['Your Vietnamese sentences here'] sentences = [tokenize(sentence) for sentence in sentences] inputs = PhobertTokenizer(sentences, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
- Install dependencies:
Cloud GPUs
For better performance, consider using cloud GPUs such as those available through Google Colab, AWS, or Azure. A quick start is available through Colab here.
License
The usage of SIMECSE_VIETNAMESE and related resources is subject to the licenses specified by the authors and the respective repositories from which the models and datasets are obtained. Always refer to the specific licensing terms associated with the model or dataset you are using.