llava llama 3 8b v1_1 transformers

xtuner

Introduction
The llava-llama-3-8b-v1_1-transformers model is a fine-tuned version of the LLaVA model, which combines the capabilities of Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336. It leverages datasets like ShareGPT4V-PT and InternVL-SFT, developed by XTuner, to enhance its image-to-text and conversational abilities in the Hugging Face LLaVA format.

Architecture
The model utilizes a CLIP-L MLP visual encoder with a resolution of 336. The architecture involves a frozen language model (LLM) and a frozen visual transformer (ViT), with fine-tuning performed using a low-rank adaptation (LoRA) on the ViT. It is trained on datasets like LLaVA-PT and LLaVA-Mix, with further fine-tuning on ShareGPT4V-PT and InternVL-SFT.

Training
The training strategy involves using a frozen LLM and ViT, with full LLM fine-tuning and LoRA applied to the ViT. The model has been trained on 558K instances of LLaVA-PT and 665K instances of LLaVA-Mix, with additional fine-tuning on 1246K instances of ShareGPT4V-PT and 1268K of InternVL-SFT.

Guide: Running Locally
To run the model locally, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Import necessary libraries:

    from transformers import pipeline
    from PIL import Image
    import requests
    
  3. Set up the model pipeline:

    model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"
    pipe = pipeline("image-to-text", model=model_id, device=0)
    
  4. Load an image:

    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
  5. Generate text from the image:

    prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
              "<|start_header_id|>assistant<|end_header_id|>\n\n")
    outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
    print(outputs)
    

For intensive computational tasks, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License
The model and associated resources are provided under the terms specified by the XTuner project. For more details, refer to the XTuner GitHub repository.

More Related APIs in Image Text To Text