X L M Roberta Large Vit B 16 Plus
immich-appIntroduction
The repository provides ONNX exports for the multilingual CLIP model, M-CLIP/XLM-Roberta-Large-Vit-B-16Plus
. This model separates visual and textual encoders to generate image and text embeddings and is intended for use with Immich, a self-hosted photo library application.
Architecture
The model architecture leverages the XLM-Roberta-Large transformer for language understanding alongside a ViT-B-16Plus visual transformer. These components are integrated to perform multilingual image-text matching by generating embeddings separately for images and text.
Training
The training process utilizes both the visual and textual encoders to create embeddings. These embeddings can then be used in downstream tasks, such as image retrieval or captioning, within the context of a multilingual environment.
Guide: Running Locally
- Clone the Repository: Clone the model repository from Hugging Face.
- Install Dependencies: Ensure that ONNX and other necessary libraries are installed in your environment.
- Download Model Files: Download the ONNX model files for both the visual and textual encoders.
- Set Up Environment: Configure your environment to use the model with the Immich application.
- Inference: Run inference using the model to generate embeddings for your dataset.
Cloud GPUs: Utilize cloud GPU services like AWS, Google Cloud, or Azure for efficient model inference, especially when processing large datasets.
License
The project follows the licensing terms provided in the Hugging Face repository. Users should ensure compliance with these terms when using the model in their applications.