Era X V L 7 B V1.5

erax-ai

Introduction

EraX-VL-7B-V1.5 is a robust multimodal model developed for Optical Character Recognition (OCR) and Visual Question Answering (VQA), with multilingual capabilities focusing on Vietnamese. It is designed to handle various document types, such as medical forms, invoices, and legal documents, making it useful for sectors like healthcare and insurance. The model is based on Qwen/Qwen2-VL-2B-Instruct and has over 7 billion parameters. It is part of the LànhGPT collection and was developed by a team at EraX, funded by Bamboo Capital Group.

Architecture

EraX-VL-7B-V1.5 is a Multimodal Transformer model, fine-tuned from Qwen/Qwen2-VL-7B-Instruct. It supports multiple languages, primarily Vietnamese, and has been enhanced for precise document recognition and multi-turn Q&A with robust reasoning capabilities.

Training

The model was fine-tuned on a diverse dataset, enhancing its capabilities in OCR and VQA. It is not yet trained on medical or car accident datasets, with updates expected by early 2025. The model's training aimed to improve its performance benchmarks, which are open-source and can be re-evaluated.

Guide: Running Locally

To run EraX-VL-7B-V1.5 locally:

  1. Install the necessary packages:

    python -m pip install git+https://github.com/huggingface/transformers accelerate
    python -m pip install qwen-vl-utils
    pip install flash-attn --no-build-isolation
    
  2. Load the model in Python:

    import torch
    from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
    
    model_path = "erax/EraX-VL-7B-V1.5"
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        attn_implementation="eager",  # Use "flash_attention_2" for Ampere GPUs
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    processor = AutoProcessor.from_pretrained(model_path)
    
  3. Prepare your images and text prompts, then run inference using the provided sample code.

Consider using cloud GPUs, such as those offered by AWS or Google Cloud, for better performance, especially when handling large datasets or complex tasks.

License

This model is released under the Apache 2.0 License, allowing for free use and distribution with proper attribution.

More Related APIs in Visual Question Answering