stanford car vit patch16

therealcyberlord

Introduction

The Stanford-Car-ViT-Patch16 model is a fine-tuned version of the ViT base model specifically trained on the Stanford Car dataset. It achieves approximately 86% accuracy on the testing set, serving as a solid baseline for further tuning.

Architecture

The model is based on the Vision Transformer (ViT) architecture, specifically the vit-base-patch16-224 model. This transformer-based architecture is particularly suited for image classification tasks.

Training

The model is trained on the Stanford Car dataset, which contains 16,185 images across 196 car classes. These classes are detailed at the Make, Model, and Year level. The dataset is divided into 8,144 training images, 6,041 testing images, and 2,000 validation images. Note that newer car models are not included in this dataset.

Guide: Running Locally

To use the model locally, follow these steps:

  1. Install the transformers library if not already installed:

    pip install transformers
    
  2. Use the following Python code to load the model and feature extractor:

    from transformers import AutoFeatureExtractor, AutoModelForImageClassification
    
    extractor = AutoFeatureExtractor.from_pretrained("therealcyberlord/stanford-car-vit-patch16")
    model = AutoModelForImageClassification.from_pretrained("therealcyberlord/stanford-car-vit-patch16")
    
  3. For more efficient processing, consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure.

License

The model is available under the Apache 2.0 license, which allows for both personal and commercial use with appropriate credit to the original authors.

More Related APIs in Image Classification