stsb m mt es distilbert base uncased
eduardofvIntroduction
This model is a test implementation of DistilBERT-base-uncased fine-tuned for Semantic Textual Similarity (STS) in Spanish. It utilizes the stsb_multi_mt
dataset, which contains Spanish translations of the STS Benchmark datasets, to evaluate and benchmark STS models.
Architecture
The model is based on DistilBERT-base-uncased architecture, fine-tuned specifically for extracting sentence embeddings and performing Semantic Textual Similarity tasks in Spanish. The fine-tuning is conducted using a modified version of the training script from the Sentence Transformers library.
Training
Training was executed using the stsb_multi_mt dataset, which includes Spanish sentences automatically translated from the English STS Benchmark. The training script, available in the repository, allows for training on other languages included in the dataset. Evaluations show that fine-tuning improves the model's performance significantly, with Pearson and Spearman correlations increasing from 0.2980 and 0.4008 to 0.7451 and 0.7364, respectively.
Guide: Running Locally
- Clone the Repository: Download the repository containing the model and scripts.
- Install Dependencies: Ensure all necessary Python packages are installed.
- Use the Script: Utilize the included script to train the model in Spanish or other supported languages.
- Evaluate Performance: Compare the fine-tuned model's performance using a provided script.
Suggestion: For efficient training and evaluation, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The model and associated scripts are provided for educational and research purposes. Check the repository for specific licensing information related to the dataset and model usage.