Introduction

DePlot is a model designed for visual language reasoning, specifically for translating images of plots and charts into linearized tables. This approach allows it to effectively use large language models (LLMs) for reasoning tasks, achieving significant improvement over previous state-of-the-art models with minimal training examples.

Architecture

DePlot operates in two primary steps: translating plots to text and reasoning over the translated text. It utilizes a modality conversion module to convert images into a format suitable for LLMs. The model leverages this conversion to perform reasoning tasks with fewer training examples than traditional models.

Training

DePlot is trained end-to-end on the standardized plot-to-table task. The training involves establishing unified task formats and metrics to ensure effective translation of visual data into text that can be processed by LLMs.

Guide: Running Locally

To run DePlot locally, follow these steps:

  1. Install Necessary Libraries: Ensure you have the transformers, requests, and PIL libraries installed.

  2. Load the Model and Processor:

    from transformers import Pix2StructProcessor, Pix2StructForConditionalGeneration
    processor = Pix2StructProcessor.from_pretrained('google/deplot')
    model = Pix2StructForConditionalGeneration.from_pretrained('google/deplot')
    
  3. Prepare the Input Image:

    import requests
    from PIL import Image
    url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
    image = Image.open(requests.get(url, stream=True).raw)
    
  4. Generate Predictions:

    inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt")
    predictions = model.generate(**inputs, max_new_tokens=512)
    print(processor.decode(predictions[0], skip_special_tokens=True))
    

For optimal performance, using cloud GPUs such as AWS EC2 or Google Cloud's GPU instances is recommended.

License

DePlot is released under the Apache-2.0 license, which allows for broad use and distribution with appropriate attribution.

More Related APIs in Visual Question Answering