tapex large finetuned tabfact

microsoft

Introduction

TAPEX is a model designed for table pre-training, enhancing models with table reasoning skills. It leverages a neural SQL executor to process tables, empowering it for tasks such as table fact verification. TAPEX is based on the BART architecture, integrating a bidirectional encoder and an autoregressive decoder.

Architecture

TAPEX utilizes the BART architecture, which is a transformer-based seq2seq model. It features a bidirectional encoder similar to BERT and an autoregressive decoder akin to GPT. This combination allows TAPEX to perform table reasoning effectively by pre-training with a neural SQL executor on a synthetic dataset of executable SQL queries.

Training

The TAPEX model is fine-tuned on the Tabfact dataset, which is specifically designed for table fact verification tasks. The model is trained to understand and execute SQL queries synthesized for table reasoning, allowing it to verify facts against tables efficiently.

Guide: Running Locally

To use TAPEX locally, follow these steps:

  1. Install Dependencies: Ensure that you have Python and PyTorch installed. Install the transformers library from Hugging Face using pip:

    pip install transformers
    
  2. Load the Model and Tokenizer: Use the following code snippet to load the TAPEX tokenizer and model:

    from transformers import TapexTokenizer, BartForSequenceClassification
    import pandas as pd
    
    tokenizer = TapexTokenizer.from_pretrained("microsoft/tapex-large-finetuned-tabfact")
    model = BartForSequenceClassification.from_pretrained("microsoft/tapex-large-finetuned-tabfact")
    
  3. Prepare Your Data: Structure your data in a pandas DataFrame and define your query:

    data = {
        "year": [1896, 1900, 1904, 2004, 2008, 2012],
        "city": ["athens", "paris", "st. louis", "athens", "beijing", "london"]
    }
    table = pd.DataFrame.from_dict(data)
    query = "beijing hosts the olympic games in 2012"
    
  4. Run the Model: Encode the table and query, then pass them through the model to obtain results:

    encoding = tokenizer(table=table, query=query, return_tensors="pt")
    outputs = model(**encoding)
    output_id = int(outputs.logits[0].argmax(dim=0))
    print(model.config.id2label[output_id])
    

For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The TAPEX model is released under the MIT License, which permits use, distribution, and modification.

More Related APIs in Table Question Answering