clip vit large patch14 ko
BingsuIntroduction
The CLIP-VIT-LARGE-PATCH14-KO model is a Korean CLIP model trained using the method of Making Monolingual Sentence Embeddings Multilingual via Knowledge Distillation. It is designed for zero-shot image classification tasks.
Architecture
This model is based on the CLIP (Contrastive Language–Image Pre-training) architecture, specifically utilizing the Vision Transformer (ViT) with large patch sizes. It supports zero-shot image classification and is compatible with popular machine learning libraries including Transformers, PyTorch, TensorFlow, and Safetensors.
Training
The training of this model involved using all Korean-English parallel data available on AIHUB. The training code can be found on GitHub: KoCLIP training code.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies:
- Ensure you have Python and PyTorch installed.
- Install the Transformers library from Hugging Face.
-
Code Example:
import requests import torch from PIL import Image from transformers import AutoModel, AutoProcessor repo = "Bingsu/clip-vit-large-patch14-ko" model = AutoModel.from_pretrained(repo) processor = AutoProcessor.from_pretrained(repo) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["고양이 두 마리", "개 두 마리"], images=image, return_tensors="pt", padding=True) with torch.inference_mode(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) print(probs)
-
Alternative Using Pipeline:
from transformers import pipeline repo = "Bingsu/clip-vit-large-patch14-ko" pipe = pipeline("zero-shot-image-classification", model=repo) url = "http://images.cocodataset.org/val2017/000000039769.jpg" result = pipe(images=url, candidate_labels=["고양이 한 마리", "고양이 두 마리", "분홍색 소파에 드러누운 고양이 친구들"]) print(result)
-
Cloud GPUs: For enhanced performance, consider using cloud-based GPU services such as AWS, Google Cloud Platform, or Azure to handle intensive computations.
License
This model is licensed under the MIT License, permitting reuse with few restrictions.