vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k LLM Model

Introduction

The ViT_Huge_Patch14_CLIP_336.LAION2B_FT_IN12K_IN1K is a Vision Transformer (ViT) model designed for image classification tasks. Pretrained using OpenCLIP on LAION-2B, it is fine-tuned on ImageNet-12k and ImageNet-1k datasets. This model is part of the timm library and is optimized for high performance in image classification and feature extraction.

Architecture

Model Type: Image classification / feature backbone
Parameters: 632.5 million
GMACs: 363.7
Activations: 213.4 million
Image Size: 336 x 336
Datasets: ImageNet-1k, LAION-2B, ImageNet-12k

Key papers associated with this model include OpenCLIP, LAION-5B, and Vision Transformers for image recognition.

Training

The ViT model is pretrained on a large-scale dataset (LAION-2B) using a contrastive language-image learning approach, then fine-tuned on ImageNet-12k and ImageNet-1k. This multi-stage training enhances its ability to generalize across varied image classification tasks.

Guide: Running Locally

Basic Steps

Install Dependencies: Ensure you have Python and PyTorch installed.
Install timm Library:
```
pip install timm
```

Load the Model:

import timm
model = timm.create_model('vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k', pretrained=True)
model = model.eval()

Prepare the Input: Use PIL to open and preprocess images.
Run Inference: Transform images and pass them through the model to obtain predictions or embeddings.

Cloud GPUs

For optimal performance, especially with large models like ViT, consider using cloud-based GPU services such as AWS, GCP, or Azure.

License

This model is licensed under the Apache-2.0 License, which allows for both personal and commercial use with attribution.

More Related APIs in Image Classification