Intern V L2 Llama3 76 B

OpenGVLab

Introduction

InternVL 2.0 is a series of multimodal large language models, featuring instruction-tuned models ranging from 1 billion to 108 billion parameters. The InternVL2-Llama3-76B, a part of this series, is designed to surpass most open-source models and perform on par with proprietary commercial models. It excels in tasks like document comprehension, scientific problem-solving, and cultural understanding, using an 8k context window to handle long texts, images, and videos effectively.

Architecture

InternVL2-Llama3-76B combines the vision component, InternViT-6B-448px-V1-5, and the language component, Hermes-2-Theta-Llama-3-70B. The model is optimized for various multimodal tasks, resulting from the merger of these components.

Training

InternVL 2.0 is trained using extensive datasets comprising long texts, images, and videos. This training enhances its capabilities in handling diverse input types. The model uses a context window of 8k, allowing it to process and comprehend more complex data compared to earlier versions. Limitations include potential biases and unexpected outputs due to the probabilistic nature of large models.

Guide: Running Locally

To run InternVL2-Llama3-76B locally, follow these steps:

  1. Install Requirements: Ensure you have transformers>=4.37.2. Install using pip:

    pip install transformers
    
  2. Model Loading: Use the provided example for 16-bit or 8-bit quantization:

    import torch
    from transformers import AutoModel
    
    path = "OpenGVLab/InternVL2-Llama3-76B"
    model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True).eval().cuda()
    
  3. Multiple GPUs: Split the model across multiple GPUs if needed:

    import math
    import torch
    from transformers import AutoTokenizer, AutoModel
    
    def split_model(model_name):
        # Define device map configuration as detailed in the full guide
        ...
    
    path = "OpenGVLab/InternVL2-Llama3-76B"
    device_map = split_model('InternVL2-Llama3-76B')
    model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device_map).eval()
    
  4. Inference: Use the model for image or text inference as shown in the examples.

Cloud GPUs: For better performance and resource management, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

This project is released under the MIT License. The model uses pre-trained Hermes-2-Theta-Llama-3-70B, licensed under the Llama 3 Community License.

More Related APIs in Image Text To Text