indo sentence bert base

firqaaa

Introduction

The INDO-SENTENCE-BERT-BASE model is a sentence-transformers model designed to map sentences and paragraphs into a 768-dimensional dense vector space. It is primarily used for tasks such as clustering and semantic search, specifically tailored for the Indonesian language. The model leverages the Sentence-BERT architecture to provide efficient sentence embeddings.

Architecture

The model utilizes the SentenceTransformer architecture, which combines a BERT model with pooling layers to generate sentence embeddings. The architecture includes:

  • A Transformer layer with a BERT model (BertModel) configured with a maximum sequence length of 512 and no lowercasing.
  • A Pooling layer that combines token embeddings using mean pooling.

Training

The model was trained using the following parameters:

  • DataLoader: A NoDuplicatesDataLoader with a batch size of 16.
  • Loss Function: MultipleNegativesRankingLoss with a scale of 20.0 and cosine similarity function.
  • Training Parameters:
    • Epochs: 5
    • Optimizer: AdamW with a learning rate of 2e-05
    • Warmup Steps: 9930
    • Weight Decay: 0.01

Guide: Running Locally

To use the INDO-SENTENCE-BERT-BASE model locally, follow these steps:

  1. Install the required libraries:

    pip install -U sentence-transformers
    
  2. Using Sentence-Transformers:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["Ibukota Perancis adalah Paris", 
                 "Menara Eifel terletak di Paris, Perancis", 
                 "Pizza adalah makanan khas Italia", 
                 "Saya kuliah di Carneige Mellon University"]
    
    model = SentenceTransformer('firqaaa/indo-sentence-bert-base')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ["Ibukota Perancis adalah Paris", 
                 "Menara Eifel terletak di Paris, Perancis", 
                 "Pizza adalah makanan khas Italia", 
                 "Saya kuliah di Carneige Mellon University"]
    
    tokenizer = AutoTokenizer.from_pretrained('firqaaa/indo-sentence-bert-base')
    model = AutoModel.from_pretrained('firqaaa/indo-sentence-bert-base')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    
  4. Cloud GPU Suggestion: To speed up processing, consider using cloud services like AWS, GCP, or Azure that offer GPU instances.

License

The INDO-SENTENCE-BERT-BASE model is released under the Apache-2.0 license, permitting wide usage with minimal restrictions.

More Related APIs in Sentence Similarity