stsb m mt es distilbert base uncased

eduardofv

Introduction

This model is a test implementation of DistilBERT-base-uncased fine-tuned for Semantic Textual Similarity (STS) in Spanish. It utilizes the stsb_multi_mt dataset, which contains Spanish translations of the STS Benchmark datasets, to evaluate and benchmark STS models.

Architecture

The model is based on DistilBERT-base-uncased architecture, fine-tuned specifically for extracting sentence embeddings and performing Semantic Textual Similarity tasks in Spanish. The fine-tuning is conducted using a modified version of the training script from the Sentence Transformers library.

Training

Training was executed using the stsb_multi_mt dataset, which includes Spanish sentences automatically translated from the English STS Benchmark. The training script, available in the repository, allows for training on other languages included in the dataset. Evaluations show that fine-tuning improves the model's performance significantly, with Pearson and Spearman correlations increasing from 0.2980 and 0.4008 to 0.7451 and 0.7364, respectively.

Guide: Running Locally

  1. Clone the Repository: Download the repository containing the model and scripts.
  2. Install Dependencies: Ensure all necessary Python packages are installed.
  3. Use the Script: Utilize the included script to train the model in Spanish or other supported languages.
  4. Evaluate Performance: Compare the fine-tuned model's performance using a provided script.

Suggestion: For efficient training and evaluation, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The model and associated scripts are provided for educational and research purposes. Check the repository for specific licensing information related to the dataset and model usage.

More Related APIs in Sentence Similarity