Qwen2 V L Math Prase 2 B Instruct

prithivMLmods

Introduction

The Qwen2-VL-Math-Prase-2B-Instruct model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct, optimized for Optical Character Recognition (OCR), image-to-text conversion, and solving math problems using LaTeX formatting. This model excels in multi-modal tasks, integrating conversational approaches with visual and textual understanding.

Architecture

  • Vision-Language Integration: Combines image understanding with natural language processing to convert images into text.
  • Optical Character Recognition (OCR): Extracts and processes textual information from images with high accuracy.
  • Math and LaTeX Support: Solves mathematical problems and outputs equations in LaTeX format.
  • Conversational Capabilities: Handles multi-turn interactions for context-aware responses.
  • Image-Text-to-Text Generation: Generates descriptive or problem-solving text from images and text inputs.
  • Secure Weight Format: Utilizes Safetensors for secure and efficient model weight loading.

Training

  • Base Model: Qwen/Qwen2-VL-2B-Instruct
  • Model Size: 2.21 billion parameters, optimized for BF16 tensor type for efficient inference.
  • Specializations: Tailored for OCR tasks in images and mathematical reasoning with LaTeX output.

Guide: Running Locally

  1. Install Required Packages:

    pip install transformers
    
  2. Load the Model:

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "prithivMLmods/Qwen2-VL-Math-Prase-2B-Instruct", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen2-VL-Math-Prase-2B-Instruct")
    
  3. Prepare Input:

    messages = [{"role": "user", "content": [{"type": "image", "image": "image_url"}, {"type": "text", "text": "Describe this image."}]}]
    
  4. Inference:

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(output_text)
    
  5. Cloud GPUs: For optimal performance, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.

License

The model is released under the Apache-2.0 license.

More Related APIs in Image Text To Text