Tooka S B E R T

PartAI

Introduction

Tooka-SBERT is a Sentence Transformer model optimized for Persian language tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering. It is based on the TookaBERT-Large model and trained to map sentences and paragraphs into a dense vector space.

Architecture

  • Model Type: Sentence Transformer
  • Base Model: TookaBERT-Large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity
  • Language: Persian

Training

The Tooka-SBERT model utilizes a Siamese network structure to compute sentence embeddings and is trained using the CachedMultipleNegativesRankingLoss function. It is designed to efficiently capture semantic relationships in text, especially in the Persian language.

Guide: Running Locally

To run the Tooka-SBERT model locally, follow these steps:

  1. Install Sentence Transformers:

    pip install -U sentence-transformers
    
  2. Load the Model and Run Inference:

    from sentence_transformers import SentenceTransformer
    
    # Load the model from Hugging Face Hub
    model = SentenceTransformer("PartAI/Tooka-SBERT")
    
    # Define sentences for encoding
    sentences = [
        'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
        'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
        'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
    ]
    
    # Encode sentences
    embeddings = model.encode(sentences)
    print(embeddings.shape)
    
    # Compute similarity scores
    similarities = model.similarity(embeddings, embeddings)
    print(similarities.shape)
    
  3. Suggestions for Cloud GPUs:

    • Use cloud platforms like AWS, Google Cloud, or Azure that provide GPU resources to enhance the performance of model inference.

License

The Tooka-SBERT model is released under the Apache 2.0 License. This permits use, distribution, and modification under the terms specified in the license.

More Related APIs in Sentence Similarity