Introduction

The Mutual Information Contrastive Sentence Embedding (miCSE) model is designed for computing sentence similarity, especially in low-shot scenarios. It uses a mutual information-based contrastive learning framework to create efficient sentence embeddings, making it suitable for tasks such as retrieval, similarity comparison, and clustering.

Architecture

miCSE employs a unique approach by aligning attention patterns across different views (embeddings from augmentations) during contrastive learning. This method enforces syntactic consistency across dropout-augmented views and regularizes self-attention distribution, enhancing sample efficiency in representation learning. This makes miCSE particularly useful for real-world applications with limited training data.

Training

The model is trained on a random collection of English sentences from Wikipedia. Available training data ranges from full-shot to low-shot datasets, with splits varying in size from 10% to 0.0064% of the SimCSE training corpus. The source code and datasets used are accessible on GitHub.

Guide: Running Locally

To run miCSE locally:

  1. Environment Setup: Ensure you have Python and PyTorch installed. Set up a virtual environment if necessary.
  2. Install Dependencies: Use pip to install the transformers library from Hugging Face.
    pip install transformers torch datasets umap-learn
    
  3. Download Model: Use the transformers library to load the tokenizer and model.
    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/miCSE")
    model = AutoModel.from_pretrained("sap-ai-research/miCSE")
    
  4. Run Inference: Encode sentences using the model and compute embeddings.
  5. Utilize Cloud: For large-scale processing, consider using cloud GPUs available on platforms like AWS, Google Cloud, or Azure to speed up computation.

License

miCSE is licensed under the Apache-2.0 License, allowing for both personal and commercial use, modification, and distribution with appropriate credit.

More Related APIs in Sentence Similarity