sentence bert base ja mean tokens v2 LLM Model

Introduction

This documentation provides an overview of the Japanese Sentence-BERT model, "sentence-bert-base-ja-mean-tokens-v2," which is designed for sentence similarity tasks. This model uses the "MultipleNegativesRankingLoss" for improved performance over its predecessor, achieving better accuracy scores on private datasets.

Architecture

The model is based on BERT architecture and utilizes the "cl-tohoku/bert-base-japanese-whole-word-masking" pre-trained model. It requires additional libraries, such as fugashi and ipadic, for tokenization and text processing.

Training

This Sentence-BERT model was trained using the "MultipleNegativesRankingLoss" function, which is known for its effectiveness in learning sentence representations. The training was performed with a focus on optimizing sentence similarity tasks, resulting in an improved performance compared to the previous version.

Guide: Running Locally

To run this model locally, follow these steps:

Install Required Libraries:

pip install transformers torch fugashi ipadic

Load the Model:

from transformers import BertJapaneseTokenizer, BertModel
import torch

class SentenceBertJapanese:
    def __init__(self, model_name_or_path, device=None):
        self.tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path)
        self.model = BertModel.from_pretrained(model_name_or_path)
        self.model.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.model.to(device)

    def _mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    @torch.no_grad()
    def encode(self, sentences, batch_size=8):
        all_embeddings = []
        iterator = range(0, len(sentences), batch_size)
        for batch_idx in iterator:
            batch = sentences[batch_idx:batch_idx + batch_size]

            encoded_input = self.tokenizer.batch_encode_plus(batch, padding="longest", 
                                            truncation=True, return_tensors="pt").to(self.device)
            model_output = self.model(**encoded_input)
            sentence_embeddings = self._mean_pooling(model_output, encoded_input["attention_mask"]).to('cpu')

            all_embeddings.extend(sentence_embeddings)

        return torch.stack(all_embeddings)

MODEL_NAME = "sonoisa/sentence-bert-base-ja-mean-tokens-v2"
model = SentenceBertJapanese(MODEL_NAME)

sentences = ["暴走したAI", "暴走した人工知能"]
sentence_embeddings = model.encode(sentences, batch_size=8)

print("Sentence embeddings:", sentence_embeddings)

Run on Cloud GPUs: For better performance, consider using cloud services that offer GPUs, such as AWS, Google Cloud, or Azure.

License

The model and its associated files are licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

More Related APIs in Feature Extraction