Llama 3.2 V 11 B cot
XkevIntroduction
Llama-3.2V-11B-cot is a visual language model designed for image-text-to-text tasks, capable of systematic reasoning. This model is the first version of LLaVA-CoT and is fine-tuned from the meta-llama/Llama-3.2-11B-Vision-Instruct model.
Architecture
- Base Model: meta-llama/Llama-3.2-11B-Vision-Instruct
- Pipeline Tag: image-text-to-text
- Library: Transformers
Training
The model is trained on the LLaVA-CoT-100k dataset. It was fine-tuned using llama-recipes with specific parameters such as:
- Learning Rate: 1e-5
- Number of Epochs: 3
- Batch Size: 4
- Context Length: 4096
- Mixed Precision: Enabled
Guide: Running Locally
To run the model locally, follow these steps:
- Setup Environment: Install necessary libraries, primarily Hugging Face Transformers.
- Download Model: Access the model via the Hugging Face Model Hub.
- Inference: Use the provided inference code for Llama-3.2-11B-Vision-Instruct to run predictions.
For performance optimization, using cloud GPUs from providers like AWS, Google Cloud, or Azure is recommended.
License
The model is licensed under the Apache-2.0 License.