Llama 3.2 V 11 B cot

Xkev

Introduction

Llama-3.2V-11B-cot is a visual language model designed for image-text-to-text tasks, capable of systematic reasoning. This model is the first version of LLaVA-CoT and is fine-tuned from the meta-llama/Llama-3.2-11B-Vision-Instruct model.

Architecture

  • Base Model: meta-llama/Llama-3.2-11B-Vision-Instruct
  • Pipeline Tag: image-text-to-text
  • Library: Transformers

Training

The model is trained on the LLaVA-CoT-100k dataset. It was fine-tuned using llama-recipes with specific parameters such as:

  • Learning Rate: 1e-5
  • Number of Epochs: 3
  • Batch Size: 4
  • Context Length: 4096
  • Mixed Precision: Enabled

Guide: Running Locally

To run the model locally, follow these steps:

  1. Setup Environment: Install necessary libraries, primarily Hugging Face Transformers.
  2. Download Model: Access the model via the Hugging Face Model Hub.
  3. Inference: Use the provided inference code for Llama-3.2-11B-Vision-Instruct to run predictions.

For performance optimization, using cloud GPUs from providers like AWS, Google Cloud, or Azure is recommended.

License

The model is licensed under the Apache-2.0 License.

More Related APIs in Image Text To Text