Table G P T2 7 B
tablegptIntroduction
TableGPT2-7B is a large-scale decoder model designed by Zhejiang University to interpret and analyze tabular data. It bridges the gap between traditional language models and the demands of structured data tasks such as business intelligence and data-driven analysis. The model is part of the Qwen2.5 family and is optimized for handling structured, tabular data.
Architecture
TableGPT2-7B is based on the Qwen2.5 architecture, featuring specialized encoding for tabular data. It includes a unique semantic encoder to capture insights from rows, columns, and entire tables. The model employs Continual Pretraining (CPT) and Supervised Fine-Tuning (SFT) to enhance performance in real-world BI applications and complex query processing. Currently, the model is available as a standalone decoder, with plans for tighter integration with DeepSpeed and vLLM.
Training
TableGPT2-7B was trained using over 593,800 curated tables and more than 86 billion tokens. It underwent CPT and was fine-tuned with over 2.36 million high-quality query-table-output tuples. This extensive dataset ensures the model meets the demands of modern structured data applications. The training data was static as of October 2024.
Guide: Running Locally
-
Install Dependencies: Ensure you have
transformers>=4.37.0
installed:pip install transformers>=4.37.0
-
Load the Model: Use the
transformers
library to load the model and tokenizer:from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "tablegpt/TableGPT2-7B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name)
-
Prepare Input Data: Use
pandas
to structure your tabular data:import pandas as pd from io import StringIO EXAMPLE_CSV_CONTENT = """...""" csv_file = StringIO(EXAMPLE_CSV_CONTENT) df = pd.read_csv(csv_file)
-
Generate Output: Format your prompt and generate responses with the model:
example_prompt_template = """...""" question = "..." prompt = example_prompt_template.format(var_name="df", df_info=df.head(5).to_string(index=False), user_question=question)
-
Deployment: Use vLLM for deployment:
pip install "vllm>=0.5.5" python -m vllm.entrypoints.openai.api_server --served-model-name TableGPT2-7B --model path/to/weights
Cloud GPUs: To enhance performance, consider using cloud-based GPUs for model training and inference.
License
TableGPT2-7B is licensed under the Apache-2.0 license, allowing for broad use and distribution with appropriate attribution.