stsb xlm r multilingual ro

BlackKakapo

Introduction

The STSB-XLM-R-MULTILINGUAL-RO is a sentence-transformers model designed to map Romanian sentences and paragraphs to a dense 768-dimensional vector space. It is suitable for tasks such as clustering and semantic search. This model is a fine-tuned version of the stsb-xlm-r-multilingual model specifically for the Romanian language.

Architecture

The model architecture includes a SentenceTransformer that utilizes an XLMRobertaModel transformer. It is configured with a maximum sequence length of 128 and employs mean pooling to generate sentence embeddings from token embeddings. The word embedding dimension is set to 768.

Training

The model was trained using the STS-ro dataset, a Romanian text dataset where scores range from 0 to 5. These scores are normalized to a range of 0 to 1 for compatibility with the EmbeddingSimilarityEvaluator. The training process involved a DataLoader with a batch size of 32 and used a CosineSimilarityLoss function. The model was trained for 10 epochs with warmup steps and a learning rate of 2e-05.

Guide: Running Locally

To use this model locally, follow these steps:

  1. Install sentence-transformers:

    pip install -U sentence-transformers
    
  2. Use the model:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('BlackKakapo/stsb-xlm-r-multilingual-ro')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Alternatively, use Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('BlackKakapo/stsb-xlm-r-multilingual-ro')
    model = AutoModel.from_pretrained('BlackKakapo/stsb-xlm-r-multilingual-ro')
    
  4. Consider using cloud GPUs: For faster performance, especially for large datasets or complex tasks, consider using cloud services that provide GPU support.

License

The licensing details for this model have not been specified in the provided documentation. Please refer to the original source or repository for licensing information.

More Related APIs in Sentence Similarity