Vintern 1 B v2

5CD-AI

Introduction

Vintern-1B-v2 is a Vietnamese multimodal model designed for tasks like Optical Character Recognition (OCR), text recognition, and Visual Question Answering (VQA). Integrating advanced Vietnamese and visual models, it operates efficiently with 1 billion parameters, making it suitable for various on-device applications.

Architecture

The Vintern-1B-v2 model combines the InternViT-300M-448px for vision and Qwen2-0.5B-Instruct for language processing. This multimodal model is optimized for visual and linguistic tasks, supporting Vietnamese and English.

Training

The model is fine-tuned on over 3 million image-question-answer pairs using datasets like Viet-OCR-VQA, Viet-Doc-VQA, and others. It employs datasets that include various document types, handwritten text, and computer science content to enhance its OCR and VQA capabilities.

Guide: Running Locally

To run Vintern-1B-v2 locally:

  1. Setup Environment: Ensure Python and necessary libraries are installed. Use pip to install torch, transformers, and torchvision.
  2. Load Model: Use the provided code snippet to load the tokenizer and model.
  3. Process Images: Preprocess images using the provided dynamic_preprocess function to fit the model's input requirements.
  4. Infer: Use the model to generate answers from input images and questions.
  5. GPU Recommendation: For optimal performance, use a cloud GPU service like Google Colab with a T4 GPU.

License

The Vintern-1B-v2 model is released under the MIT License, allowing flexible use and distribution.

More Related APIs in Visual Question Answering