indo sentence bert base
firqaaaIntroduction
The INDO-SENTENCE-BERT-BASE model is a sentence-transformers model designed to map sentences and paragraphs into a 768-dimensional dense vector space. It is primarily used for tasks such as clustering and semantic search, specifically tailored for the Indonesian language. The model leverages the Sentence-BERT architecture to provide efficient sentence embeddings.
Architecture
The model utilizes the SentenceTransformer architecture, which combines a BERT model with pooling layers to generate sentence embeddings. The architecture includes:
- A Transformer layer with a BERT model (
BertModel
) configured with a maximum sequence length of 512 and no lowercasing. - A Pooling layer that combines token embeddings using mean pooling.
Training
The model was trained using the following parameters:
- DataLoader: A
NoDuplicatesDataLoader
with a batch size of 16. - Loss Function:
MultipleNegativesRankingLoss
with a scale of 20.0 and cosine similarity function. - Training Parameters:
- Epochs: 5
- Optimizer: AdamW with a learning rate of 2e-05
- Warmup Steps: 9930
- Weight Decay: 0.01
Guide: Running Locally
To use the INDO-SENTENCE-BERT-BASE model locally, follow these steps:
-
Install the required libraries:
pip install -U sentence-transformers
-
Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer sentences = ["Ibukota Perancis adalah Paris", "Menara Eifel terletak di Paris, Perancis", "Pizza adalah makanan khas Italia", "Saya kuliah di Carneige Mellon University"] model = SentenceTransformer('firqaaa/indo-sentence-bert-base') embeddings = model.encode(sentences) print(embeddings)
-
Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ["Ibukota Perancis adalah Paris", "Menara Eifel terletak di Paris, Perancis", "Pizza adalah makanan khas Italia", "Saya kuliah di Carneige Mellon University"] tokenizer = AutoTokenizer.from_pretrained('firqaaa/indo-sentence-bert-base') model = AutoModel.from_pretrained('firqaaa/indo-sentence-bert-base') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
-
Cloud GPU Suggestion: To speed up processing, consider using cloud services like AWS, GCP, or Azure that offer GPU instances.
License
The INDO-SENTENCE-BERT-BASE model is released under the Apache-2.0 license, permitting wide usage with minimal restrictions.