sentence_similarity_spanish_es
hiiamsidIntroduction
The sentence_similarity_spanish_es
model, developed by HIIAMSID, is a Spanish sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It is designed for tasks such as clustering or semantic search.
Architecture
The model is built using a Transformer architecture with BERT as the underlying model. It includes a pooling layer that supports mean pooling of token embeddings. The model is configured with a maximum sequence length of 512 and does not apply lower casing.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Training
The model was trained using the stsb_multi_mt
dataset with dccuchile/bert-base-spanish-wwm-cased
as the base model. Training involved a CosineSimilarityLoss
with a batch size of 16, over 4 epochs. An AdamW optimizer with a learning rate of 2e-05 and a warmup-linear scheduler was used.
{
"epochs": 4,
"evaluation_steps": 1000,
"optimizer_params": {
"lr": 2e-05
},
"warmup_steps": 144,
"weight_decay": 0.01
}
Guide: Running Locally
Using Sentence-Transformers
- Install the
sentence-transformers
library:pip install -U sentence-transformers
- Use the model for generating embeddings:
from sentence_transformers import SentenceTransformer sentences = ['Mi nombre es Siddhartha', 'Mis amigos me llamaron por mi nombre Siddhartha'] model = SentenceTransformer('hiiamsid/sentence_similarity_spanish_es') embeddings = model.encode(sentences) print(embeddings)
Using Hugging Face Transformers
- Install the necessary libraries:
pip install transformers torch
- Use the model with pooling for sentence embeddings:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['Mi nombre es Siddhartha', 'Mis amigos me llamaron por mi nombre Siddhartha'] tokenizer = AutoTokenizer.from_pretrained('hiiamsid/sentence_similarity_spanish_es') model = AutoModel.from_pretrained('hiiamsid/sentence_similarity_spanish_es') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
Cloud GPUs
For efficient processing, it is recommended to use cloud-based GPUs from providers like AWS, Google Cloud, or Azure to speed up inference and training processes.
License
This model is licensed under the Apache-2.0 License, which allows for both personal and commercial use.