Era X V L 2 B V1.5

erax-ai

Introduction

EraX-VL-2B-V1.5 is an advanced multimodal model developed as part of the EraX's LànhGPT collection. It provides powerful Optical Character Recognition (OCR) and Visual Question Answering (VQA) capabilities across various languages, with a strong focus on Vietnamese. The model is designed to recognize details in different document types such as medical forms, invoices, and identity cards, making it suitable for applications in healthcare, insurance, and other industries. It is built on Qwen/Qwen2-VL-2B-Instruct and fine-tuned for improved performance. EraX-VL-2B-V1.5 supports multi-turn Q&A and reasoning with a compact parameter size of over 2 billion.

Architecture

EraX-VL-2B-V1.5 is a multimodal transformer model with over 2 billion parameters. It is fine-tuned from the Qwen/Qwen2-VL-2B-Instruct model and is equipped to handle multiple languages, with a primary focus on Vietnamese. The model integrates the capabilities of Optical Character Recognition and Visual Question Answering, tailored for diverse document types.

Training

The model has been fine-tuned to enhance its OCR and VQA capabilities. However, it has not yet been trained on specific datasets like medical X-rays or car accident data. The development team aims to expand its training data in future versions, expected around 2025.

Guide: Running Locally

To run EraX-VL-2B-V1.5 locally, follow these steps:

  1. Install Required Packages:

    python -m pip install git+https://github.com/huggingface/transformers accelerate
    python -m pip install qwen-vl-utils
    pip install flash-attn --no-build-isolation
    
  2. Load the Model:

    from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
    model_path = "erax/EraX-VL-2B-V1.5"
    model = Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    
  3. Process Input Data: Use AutoProcessor and qwen_vl_utils to prepare your data input.

  4. Inference: Set up and run the model on your data to generate text outputs.

For optimal performance, consider using cloud GPUs with Ampere architecture, if available.

License

EraX-VL-2B-V1.5 is released under the Apache 2.0 License, allowing for both personal and commercial use with proper attribution.

More Related APIs in Visual Question Answering