Vintern 1 B v2
5CD-AIIntroduction
Vintern-1B-v2 is a Vietnamese multimodal model designed for tasks like Optical Character Recognition (OCR), text recognition, and Visual Question Answering (VQA). Integrating advanced Vietnamese and visual models, it operates efficiently with 1 billion parameters, making it suitable for various on-device applications.
Architecture
The Vintern-1B-v2 model combines the InternViT-300M-448px for vision and Qwen2-0.5B-Instruct for language processing. This multimodal model is optimized for visual and linguistic tasks, supporting Vietnamese and English.
Training
The model is fine-tuned on over 3 million image-question-answer pairs using datasets like Viet-OCR-VQA, Viet-Doc-VQA, and others. It employs datasets that include various document types, handwritten text, and computer science content to enhance its OCR and VQA capabilities.
Guide: Running Locally
To run Vintern-1B-v2 locally:
- Setup Environment: Ensure Python and necessary libraries are installed. Use
pip
to installtorch
,transformers
, andtorchvision
. - Load Model: Use the provided code snippet to load the tokenizer and model.
- Process Images: Preprocess images using the provided
dynamic_preprocess
function to fit the model's input requirements. - Infer: Use the model to generate answers from input images and questions.
- GPU Recommendation: For optimal performance, use a cloud GPU service like Google Colab with a T4 GPU.
License
The Vintern-1B-v2 model is released under the MIT License, allowing flexible use and distribution.