Qwen2 V L O C R 2 B Instruct LLM Model

Introduction

The Qwen2-VL-OCR-2B-Instruct model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct, specifically designed for Optical Character Recognition (OCR), image-to-text conversion, and solving math problems with LaTeX formatting. It integrates conversational capabilities with visual and textual understanding to effectively manage multi-modal tasks.

Architecture

The model achieves state-of-the-art performance in visual understanding benchmarks and can understand videos longer than 20 minutes. It is capable of complex reasoning and decision-making, allowing integration with devices like mobile phones and robots. Furthermore, it supports multiple languages, including English, Chinese, and various European and Asian languages.

Training

Base Model: Qwen/Qwen2-VL-2B-Instruct
Model Size: 2.21 billion parameters
Optimizations: BF16 tensor type for efficient inference
Specializations: OCR tasks, mathematical reasoning, and LaTeX output

Guide: Running Locally

Set Up Environment:
- Install the transformers library.
- Set up the qwen_vl_utils package for vision processing.

Loading the Model:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen2-VL-OCR-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen2-VL-OCR-2B-Instruct")

Processing Input:
- Prepare image and text inputs.
- Use processor to apply chat templates and tokenize.
Inference:
- Generate output with the model.
- Decode and print the output text.
Cloud GPUs:
- Consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance.

License

The model is licensed under the Apache 2.0 License, which allows for free use, distribution, and modification under specified conditions.

More Related APIs in Image Text To Text