Introduction

FashionCLIP is a CLIP-based model designed to generate product representations for fashion concepts using a zero-shot learning approach. It leverages the pre-trained ViT-B/32 checkpoint from OpenAI's CLIP and fine-tunes it on a specialized fashion dataset to enhance its performance in fashion-related tasks and datasets.

Architecture

FashionCLIP employs a ViT-B/32 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. The model is trained to enhance the similarity of (image, text) pairs using a contrastive loss function.

Training

FashionCLIP was trained on over 800,000 (image, text) pairs from the Farfetch dataset, which includes more than 3,000 brands and various object types. The dataset consists of standard product images on a white background and text descriptions that combine highlights and short product descriptions. The model fine-tunes the CLIP architecture, improving zero-shot performance across benchmarks by leveraging increased training data.

Guide: Running Locally

  1. Setup Environment: Clone the FashionCLIP repository from GitHub and install the required dependencies, including transformers and torch.
  2. Download Model: Access the model weights via Hugging Face's model hub.
  3. Run Inference: Use Python scripts provided in the repository to perform zero-shot image classification.
  4. Cloud GPU Recommendation: For better performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

FashionCLIP is released under the MIT License, allowing for flexibility in use and distribution while maintaining attribution requirements.

More Related APIs in Zero Shot Image Classification