Qwen2 V L O C R 2 B Instruct
prithivMLmodsIntroduction
The Qwen2-VL-OCR-2B-Instruct model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct, specifically designed for Optical Character Recognition (OCR), image-to-text conversion, and solving math problems with LaTeX formatting. It integrates conversational capabilities with visual and textual understanding to effectively manage multi-modal tasks.
Architecture
The model achieves state-of-the-art performance in visual understanding benchmarks and can understand videos longer than 20 minutes. It is capable of complex reasoning and decision-making, allowing integration with devices like mobile phones and robots. Furthermore, it supports multiple languages, including English, Chinese, and various European and Asian languages.
Training
- Base Model: Qwen/Qwen2-VL-2B-Instruct
- Model Size: 2.21 billion parameters
- Optimizations: BF16 tensor type for efficient inference
- Specializations: OCR tasks, mathematical reasoning, and LaTeX output
Guide: Running Locally
-
Set Up Environment:
- Install the
transformers
library. - Set up the
qwen_vl_utils
package for vision processing.
- Install the
-
Loading the Model:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor model = Qwen2VLForConditionalGeneration.from_pretrained( "prithivMLmods/Qwen2-VL-OCR-2B-Instruct", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen2-VL-OCR-2B-Instruct")
-
Processing Input:
- Prepare image and text inputs.
- Use processor to apply chat templates and tokenize.
-
Inference:
- Generate output with the model.
- Decode and print the output text.
-
Cloud GPUs:
- Consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance.
License
The model is licensed under the Apache 2.0 License, which allows for free use, distribution, and modification under specified conditions.