sentence B E R Tino
efedericiIntroduction
Sentence-BERTino is a sentence-transformers model designed to map sentences and paragraphs into a 768-dimensional dense vector space. It is useful for tasks such as clustering and semantic search. The model is trained on datasets comprising question/context pairs and tags/news-article pairs.
Architecture
The architecture of Sentence-BERTino is based on the DistilBERT model. It incorporates a transformer with a sequence length of 512 and uses mean pooling for creating sentence embeddings. The pooling operation is configured to average token embeddings, accounting for the attention mask.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Training
The model was trained using a dataset composed of question/context pairs from squad-it
and tags/news-article pairs obtained through scraping. This training enhances its capability to produce meaningful sentence embeddings for tasks involving semantic similarity and text clustering.
Guide: Running Locally
Setup
-
Install Sentence-Transformers
Install the necessary package by running:pip install -U sentence-transformers
-
Using Sentence-BERTino with Sentence-Transformers
Load and use the model with the following code:from sentence_transformers import SentenceTransformer sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"] model = SentenceTransformer('efederici/sentence-BERTino') embeddings = model.encode(sentences) print(embeddings)
-
Using Hugging Face Transformers
Alternatively, use the model with Hugging Face Transformers:from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"] tokenizer = AutoTokenizer.from_pretrained('efederici/sentence-BERTino') model = AutoModel.from_pretrained('efederici/sentence-BERTino') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
Cloud GPUs
For faster inference and training, consider using cloud GPU services like AWS, Google Cloud Platform, or Azure.
License
This model is licensed under the Apache 2.0 License, allowing wide use and distribution.