Vinqw 1 B v1

antphb

Vinqw-1B Model Documentation

Introduction

Vinqw-1B is a 1-billion-parameter Vision-Language Model (VLM) that integrates the InternViT-300m-448px vision model and the Qwen2.5-0.5B-Instruct language model. It features a novel image processing method and is trained on a curated dataset created using Gemini.

Architecture

The model follows the "ViT-MLP-LLM" architecture paradigm. It introduces a unique image processing method at the pre-processing stage to enhance input to the InternViT-300M-448px model. This approach includes padding and resizing images to maintain a consistent aspect ratio, which helps improve text detection and preserve context by reducing the chance of text truncation.

Architecture

Training

Vinqw-1B's training involved a combination of pre-training and fine-tuning datasets. Pre-training utilized datasets such as Viet-OCR-VQA, Viet-Doc-VQA, Viet-Doc-VQA-II, and Vista. Fine-tuning was conducted using the TD-ViOCR-CPVQA dataset. The model's performance, compared with the DHR method, showed a 0.2-point improvement on the CIDEr scale.

Benchmarks

The model demonstrated superior performance with a CIDEr score of 5.3619 when using the PCOHR method. It achieved competitive Gemini scores when compared with other models, such as Vintern-1B-V2 and EraX-VL-7B-V1.0.

Guide: Running Locally

  1. Clone the repository from Hugging Face.
  2. Install required dependencies using pip:
    pip install transformers safetensors
    
  3. Load the model using the Transformers library in Python.
  4. Execute the model on your dataset.

For optimal performance, especially with large datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The usage and distribution of Vinqw-1B are subject to the licensing terms provided by its authors. Please refer to the respective repositories for detailed license information.

More Related APIs in Visual Question Answering