sentence bert base ja mean tokens v2
sonoisaIntroduction
This documentation provides an overview of the Japanese Sentence-BERT model, "sentence-bert-base-ja-mean-tokens-v2," which is designed for sentence similarity tasks. This model uses the "MultipleNegativesRankingLoss" for improved performance over its predecessor, achieving better accuracy scores on private datasets.
Architecture
The model is based on BERT architecture and utilizes the "cl-tohoku/bert-base-japanese-whole-word-masking" pre-trained model. It requires additional libraries, such as fugashi
and ipadic
, for tokenization and text processing.
Training
This Sentence-BERT model was trained using the "MultipleNegativesRankingLoss" function, which is known for its effectiveness in learning sentence representations. The training was performed with a focus on optimizing sentence similarity tasks, resulting in an improved performance compared to the previous version.
Guide: Running Locally
To run this model locally, follow these steps:
-
Install Required Libraries:
pip install transformers torch fugashi ipadic
-
Load the Model:
from transformers import BertJapaneseTokenizer, BertModel import torch class SentenceBertJapanese: def __init__(self, model_name_or_path, device=None): self.tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path) self.model = BertModel.from_pretrained(model_name_or_path) self.model.eval() if device is None: device = "cuda" if torch.cuda.is_available() else "cpu" self.device = torch.device(device) self.model.to(device) def _mean_pooling(self, model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) @torch.no_grad() def encode(self, sentences, batch_size=8): all_embeddings = [] iterator = range(0, len(sentences), batch_size) for batch_idx in iterator: batch = sentences[batch_idx:batch_idx + batch_size] encoded_input = self.tokenizer.batch_encode_plus(batch, padding="longest", truncation=True, return_tensors="pt").to(self.device) model_output = self.model(**encoded_input) sentence_embeddings = self._mean_pooling(model_output, encoded_input["attention_mask"]).to('cpu') all_embeddings.extend(sentence_embeddings) return torch.stack(all_embeddings) MODEL_NAME = "sonoisa/sentence-bert-base-ja-mean-tokens-v2" model = SentenceBertJapanese(MODEL_NAME) sentences = ["暴走したAI", "暴走した人工知能"] sentence_embeddings = model.encode(sentences, batch_size=8) print("Sentence embeddings:", sentence_embeddings)
-
Run on Cloud GPUs: For better performance, consider using cloud services that offer GPUs, such as AWS, Google Cloud, or Azure.
License
The model and its associated files are licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).