clip vit large patch14 ko

Bingsu

Introduction

The CLIP-VIT-LARGE-PATCH14-KO model is a Korean CLIP model trained using the method of Making Monolingual Sentence Embeddings Multilingual via Knowledge Distillation. It is designed for zero-shot image classification tasks.

Architecture

This model is based on the CLIP (Contrastive Language–Image Pre-training) architecture, specifically utilizing the Vision Transformer (ViT) with large patch sizes. It supports zero-shot image classification and is compatible with popular machine learning libraries including Transformers, PyTorch, TensorFlow, and Safetensors.

Training

The training of this model involved using all Korean-English parallel data available on AIHUB. The training code can be found on GitHub: KoCLIP training code.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies:

    • Ensure you have Python and PyTorch installed.
    • Install the Transformers library from Hugging Face.
  2. Code Example:

    import requests
    import torch
    from PIL import Image
    from transformers import AutoModel, AutoProcessor
    
    repo = "Bingsu/clip-vit-large-patch14-ko"
    model = AutoModel.from_pretrained(repo)
    processor = AutoProcessor.from_pretrained(repo)
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    inputs = processor(text=["고양이 두 마리", "개 두 마리"], images=image, return_tensors="pt", padding=True)
    with torch.inference_mode():
        outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    
    print(probs)
    
  3. Alternative Using Pipeline:

    from transformers import pipeline
    
    repo = "Bingsu/clip-vit-large-patch14-ko"
    pipe = pipeline("zero-shot-image-classification", model=repo)
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    result = pipe(images=url, candidate_labels=["고양이 한 마리", "고양이 두 마리", "분홍색 소파에 드러누운 고양이 친구들"])
    
    print(result)
    
  4. Cloud GPUs: For enhanced performance, consider using cloud-based GPU services such as AWS, Google Cloud Platform, or Azure to handle intensive computations.

License

This model is licensed under the MIT License, permitting reuse with few restrictions.

More Related APIs in Zero Shot Image Classification