sentence B E R Tino

efederici

Introduction

Sentence-BERTino is a sentence-transformers model designed to map sentences and paragraphs into a 768-dimensional dense vector space. It is useful for tasks such as clustering and semantic search. The model is trained on datasets comprising question/context pairs and tags/news-article pairs.

Architecture

The architecture of Sentence-BERTino is based on the DistilBERT model. It incorporates a transformer with a sequence length of 512 and uses mean pooling for creating sentence embeddings. The pooling operation is configured to average token embeddings, accounting for the attention mask.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Training

The model was trained using a dataset composed of question/context pairs from squad-it and tags/news-article pairs obtained through scraping. This training enhances its capability to produce meaningful sentence embeddings for tasks involving semantic similarity and text clustering.

Guide: Running Locally

Setup

  1. Install Sentence-Transformers
    Install the necessary package by running:

    pip install -U sentence-transformers
    
  2. Using Sentence-BERTino with Sentence-Transformers
    Load and use the model with the following code:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"]
    model = SentenceTransformer('efederici/sentence-BERTino')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers
    Alternatively, use the model with Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"]
    tokenizer = AutoTokenizer.from_pretrained('efederici/sentence-BERTino')
    model = AutoModel.from_pretrained('efederici/sentence-BERTino')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Cloud GPUs

For faster inference and training, consider using cloud GPU services like AWS, Google Cloud Platform, or Azure.

License

This model is licensed under the Apache 2.0 License, allowing wide use and distribution.

More Related APIs in Sentence Similarity