matcha chart2text pew
googleIntroduction
The MATCHA model is fine-tuned on the Chart2text-pew dataset, specifically designed for chart summarization tasks. MATCHA enhances visual language models by integrating math reasoning and chart derendering pretraining. It shows significant improvements in benchmarks like PlotQA and ChartQA.
Architecture
MATCHA builds upon the Pix2Struct, a visual language model that converts images to text. It introduces several pretraining tasks focused on plot deconstruction and numerical reasoning, key for visual language modeling. The model aims to bridge the gap in understanding visual language data, such as charts and infographics.
Training
The MATCHA pretraining starts with Pix2Struct, focusing on enhancing the model's capabilities in jointly modeling charts/plots and language data. The pretraining tasks are designed to improve plot deconstruction and numerical reasoning. This approach has demonstrated improvements in various domains, including screenshots, textbook diagrams, and document figures.
Guide: Running Locally
To run the MATCHA model locally, follow these steps:
- Install Dependencies: Ensure you have the Hugging Face Transformers library installed.
pip install transformers
- Load the Model:
from transformers import Pix2StructProcessor, Pix2StructForConditionalGeneration processor = Pix2StructProcessor.from_pretrained('google/matcha-chart2text-pew') model = Pix2StructForConditionalGeneration.from_pretrained('google/matcha-chart2text-pew')
- Prepare Input:
import requests from PIL import Image url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/20294671002019.png" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt")
- Generate Predictions:
predictions = model.generate(**inputs, max_new_tokens=512) print(processor.decode(predictions[0], skip_special_tokens=True))
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The MATCHA model is released under the Apache 2.0 license. This allows for both commercial and non-commercial use, distribution, and modification with appropriate attribution.