sarashina embedding v1 1b

sbintuitions

Introduction

"Sarashina-Embedding-v1-1B" is a Japanese text embedding model built upon the 1.2B-parameter Japanese LLM "Sarashina2.1-1B". It is designed for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications. The model employs multi-stage contrastive learning and achieves state-of-the-art performance on the JMTEB benchmark.

Architecture

The model is a Sentence Transformer with the following architecture:

  • Base Model: Sarashina2.1-1B
  • Maximum Sequence Length: 8,192 tokens
  • Output Dimensions: 1,792
  • Similarity Function: Cosine Similarity
  • Language: Japanese

Detailed architecture:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)

Training

The model was developed using a two-stage training process:

Stage 1: Weakly-Supervised Learning

  • Utilizes contrastive training on weakly-supervised data from web-crawled and open datasets.
  • Total data: 126,744,763 entries.

Stage 2: Supervised Fine-Tuning

  • Focused on query-document similarity with datasets like JSNLI, NU-MNLI, and Mr. TyDi.
  • Total data: 233,072 entries.

Guide: Running Locally

  1. Install Sentence Transformers Library:

    pip install -U sentence-transformers
    
  2. Load and Run the Model:

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
    sentences = [
        '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
        'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
        'サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
    ]
    embeddings = model.encode(sentences)
    print(embeddings.shape)
    
  3. Cloud GPUs:

    • For large models like Sarashina-Embedding-v1-1B, consider using cloud GPU services such as AWS, Google Cloud, or Azure for efficient processing.

License

The model is under the Sarashina Model NonCommercial License Agreement. Commercial use requires contacting SB Intuitions through their contact page.

More Related APIs in Sentence Similarity