mi C S E
sap-ai-researchIntroduction
The Mutual Information Contrastive Sentence Embedding (miCSE) model is designed for computing sentence similarity, especially in low-shot scenarios. It uses a mutual information-based contrastive learning framework to create efficient sentence embeddings, making it suitable for tasks such as retrieval, similarity comparison, and clustering.
Architecture
miCSE employs a unique approach by aligning attention patterns across different views (embeddings from augmentations) during contrastive learning. This method enforces syntactic consistency across dropout-augmented views and regularizes self-attention distribution, enhancing sample efficiency in representation learning. This makes miCSE particularly useful for real-world applications with limited training data.
Training
The model is trained on a random collection of English sentences from Wikipedia. Available training data ranges from full-shot to low-shot datasets, with splits varying in size from 10% to 0.0064% of the SimCSE training corpus. The source code and datasets used are accessible on GitHub.
Guide: Running Locally
To run miCSE locally:
- Environment Setup: Ensure you have Python and PyTorch installed. Set up a virtual environment if necessary.
- Install Dependencies: Use
pip
to install thetransformers
library from Hugging Face.pip install transformers torch datasets umap-learn
- Download Model: Use the
transformers
library to load the tokenizer and model.from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/miCSE") model = AutoModel.from_pretrained("sap-ai-research/miCSE")
- Run Inference: Encode sentences using the model and compute embeddings.
- Utilize Cloud: For large-scale processing, consider using cloud GPUs available on platforms like AWS, Google Cloud, or Azure to speed up computation.
License
miCSE is licensed under the Apache-2.0 License, allowing for both personal and commercial use, modification, and distribution with appropriate credit.