Tooka S B E R T
PartAIIntroduction
Tooka-SBERT is a Sentence Transformer model optimized for Persian language tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering. It is based on the TookaBERT-Large model and trained to map sentences and paragraphs into a dense vector space.
Architecture
- Model Type: Sentence Transformer
- Base Model: TookaBERT-Large
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 tokens
- Similarity Function: Cosine Similarity
- Language: Persian
Training
The Tooka-SBERT model utilizes a Siamese network structure to compute sentence embeddings and is trained using the CachedMultipleNegativesRankingLoss function. It is designed to efficiently capture semantic relationships in text, especially in the Persian language.
Guide: Running Locally
To run the Tooka-SBERT model locally, follow these steps:
-
Install Sentence Transformers:
pip install -U sentence-transformers
-
Load the Model and Run Inference:
from sentence_transformers import SentenceTransformer # Load the model from Hugging Face Hub model = SentenceTransformer("PartAI/Tooka-SBERT") # Define sentences for encoding sentences = [ 'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.', 'درناها با قامتی بلند و بالهای پهن، از زیباترین پرندگان مهاجر به شمار میروند.', 'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمیکنند.' ] # Encode sentences embeddings = model.encode(sentences) print(embeddings.shape) # Compute similarity scores similarities = model.similarity(embeddings, embeddings) print(similarities.shape)
-
Suggestions for Cloud GPUs:
- Use cloud platforms like AWS, Google Cloud, or Azure that provide GPU resources to enhance the performance of model inference.
License
The Tooka-SBERT model is released under the Apache 2.0 License. This permits use, distribution, and modification under the terms specified in the license.