typhoon2 qwen2vl 7b vision instruct

scb10x

Introduction

Typhoon2-qwen2vl-7b-vision-instruct is a vision-language model developed to support image-based applications, optimized for Thai and English languages. It is built upon the Qwen2-VL-7B-Instruct architecture, designed to process both images and videos but primarily focused on image inputs for this model variant.

Architecture

  • Model Type: A 7 billion parameter instruct decoder-only model with a vision encoder based on the Qwen2 architecture.
  • Languages: Thai 🇹🇭 and English 🇬🇧.
  • Library Requirement: Requires transformers version 4.38.0 or newer.
  • Demo: Available at Open Typhoon Vision.

Training

The model was trained using a combination of image and text inputs, enabling it to handle complex vision-language tasks. Evaluation metrics include various benchmarks indicating its performance against other models in tasks like OCR and multimedia question answering.

Guide: Running Locally

  1. Install Dependencies:

    pip install torch transformers accelerate pillow
    
  2. Setup Model and Processor:

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    
    model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        model_name, torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained(model_name)
    
  3. Process an Image:

    from PIL import Image
    import requests
    
    url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    conversation = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย"}]}]
    text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt").to("cuda")
    output_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(output_text)
    
  4. Cloud GPU Suggestion: For optimal performance, consider using cloud services like AWS, Google Cloud, or Azure that offer GPU instances.

License

This project is licensed under the Apache 2.0 License, allowing for both personal and commercial use, modification, and distribution, provided that appropriate credit is given to the original authors.

More Related APIs in Text Generation