llava llama 3 8b v1_1 gguf

xtuner

Introduction

The llava-llama-3-8b-v1_1 model is a variant of the LLaVA model fine-tuned from Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 using datasets like ShareGPT4V-PT and InternVL-SFT. It is designed for image-to-text tasks and is available in GGUF format.

Architecture

The model employs a visual encoder with CLIP-Large and uses a multi-layer perceptron (MLP) for projecting visual features. It maintains a resolution of 336 and follows a strategy of frozen language and vision transformers during pretraining, while fully utilizing the language model and LoRA for vision transformers in fine-tuning.

Training

The fine-tuning of llava-llama-3-8b-v1_1 is performed with datasets such as ShareGPT4V-PT and InternVL-SFT, using a strategy that leverages frozen vision transformers and fully engaged language models.

Guide: Running Locally

  1. Download Models:

    • wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-mmproj-f16.gguf
    • wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-f16.gguf
    • wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-int4.gguf
  2. Using Ollama:

    • For FP16:
      ollama create llava-llama3-f16 -f ./OLLAMA_MODELFILE_F16
      ollama run llava-llama3-f16 "xx.png Describe this image"
      
    • For INT4:
      ollama create llava-llama3-int4 -f ./OLLAMA_MODELFILE_INT4
      ollama run llava-llama3-int4 "xx.png Describe this image"
      
  3. Using Llama.cpp:

    • Build llama.cpp and llava-cli.
    • For FP16:
      ./llava-cli -m ./llava-llama-3-8b-v1_1-f16.gguf --mmproj ./llava-llama-3-8b-v1_1-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096 -e -p "<|start_header_id|>user<|end_header_id|>\n\n<image>\nDescribe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
      
    • For INT4:
      ./llava-cli -m ./llava-llama-3-8b-v1_1-int4.gguf --mmproj ./llava-llama-3-8b-v1_1-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096 -e -p "<|start_header_id|>user<|end_header_id|>\n\n<image>\nDescribe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
      

Suggested Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for enhanced performance and scalability.

License

Refer to the XTuner GitHub repository for licensing details.

More Related APIs in Image To Text