Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.
Despite the rapid advancement of LLM-based agents in data science automation, existing solutions often struggle with complex, multi-step data science tasks and lack transparency in their decision-making processes. We propose AutoKaggle, a universal multi-agent framework that provides end-to-end processing solutions for tabular data through several key innovations:
Figure 2: Development based on Iterative debugging and testing.
In our evaluation across 8 Kaggle data science competitions, AutoKaggle achieved a validation submission rate of 0.85 and a comprehensive score of 0.82, demonstrating its effectiveness in handling complex data science tasks while maintaining transparency and interpretability throughout the process.
We provide two schema formats for each machine learning tool: JSON and Markdown.
Here, we take the FillMissingValues tool as an example and provide schemas in Markdown
format.
Description: Fill missing values in specified columns of a DataFrame. This tool can handle both numerical and categorical features by using different filling methods.
Applicable Situations: Handle missing values in various types of features.
Required Parameters: data, columns
Result: Successfully fill missing values in the specified column(s) of data.
We select eight Kaggle competitions that predominantly use tabular datasets, focusing on classification and regression tasks. These competitions are categorized into two types: classic Kaggle and Recent Kaggle. Classic Kaggle competitions are those that begin before October 2023 with at least 500 participants, whereas Recent Kaggle competitions begin in 2024 or later. As our analysis relies on GPT-4o, which is trained on data available until October 2023, it includes most of the Classic Kaggle competitions. To evaluate the generalization capabilities of AutoKaggle, we therefore focus on competitions initiated after 2024. Additionally, we classify these competitions into three difficulty levels: easy, medium, and hard. For each dataset, we access the corresponding competition's homepage on Kaggle, extract content from the overview and data description sections, and compile this information into a file named overview.txt. This file, along with the original competition data files, forms the primary input for AutoKaggle.
Category | No. | Task Name | Task | Level | Teams | Train | Test |
---|---|---|---|---|---|---|---|
Classic | 1 | Titanic | Classification | Medium | 13994 | 891 | 418 |
2 | Spaceship Titanic | Classification | Easy | 1720 | 8693 | 4277 | |
3 | House Prices | Regression | Medium | 4383 | 1460 | 1459 | |
4 | Monsters | Classification | Easy | 763 | 371 | 529 | |
Recent | 5 | Academic Success | Regression | Medium | 2684 | 76.5K | 51K |
6 | Bank Churn | Regression | Easy | 3632 | 165K | 110K | |
7 | Obesity Risk | Classification | Easy | 3587 | 20.8K | 13.8K | |
8 | Plate Defect | Regression | Medium | 2199 | 19.2K | 12.8K |
Metric | Setting / Task | Classic | Recent | Avg. | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 | |||
Made Submission | AutoKaggle gpt-4o | 1 | 0.80 | 0.80 | 1 | 0.80 | 0.80 | 0.80 | 0.80 | 0.85 |
AutoKaggle o1-mini | 1 | 0.60 | 0.60 | 1 | 0.60 | 0.80 | 0.60 | 0.60 | 0.73 | |
AIDE gpt-4o | 1 | 0.40 | 0.20 | 0.60 | 1 | 0.80 | 0.80 | 0 | 0.60 | |
Valid Submission | AutoKaggle gpt-4o | 1 | 0.80 | 0.80 | 1 | 0.80 | 0.60 | 0.80 | 0.80 | 0.83 |
AutoKaggle o1-mini | 1 | 0.60 | 0.60 | 1 | 0.60 | 0.60 | 0.60 | 0.60 | 0.70 | |
AIDE gpt-4o | 1 | 0.40 | 0.20 | 0.40 | 1 | 0.80 | 0.80 | 0 | 0.58 | |
Comprehensive Score | AutoKaggle gpt-4o | 0.888 | 0.786 | 0.831 | 0.862 | 0.810 | 0.728 | 0.848 | 0.812 | 0.821 |
AutoKaggle o1-mini | 0.879 | 0.680 | 0.729 | 0.863 | 0.709 | 0.735 | 0.742 | 0.735 | 0.759 | |
AIDE gpt-4o | 0.872 | 0.597 | 0.542 | 0.561 | 0.918 | 0.793 | 0.848 | 0 | 0.641 |
Average normalized performance score for different settings/tasks.
Made submission and Valid submission. We first evaluated the success rate of valid submission.csv file generation across different experimental configurations. The AutoKaggle framework, implemented with GPT-4o, demonstrated superior performance with an average valid submission rate of 83% across all 8 Kaggle tasks, surpassing the AIDE framework by 28%. These results underscore the robustness of our framework in executing comprehensive data science workflows. While the AIDE framework successfully processed Tasks 1-7, which involved single-variable classification or regression on tabular data, it failed to generate valid submissions for Task 8, a multi-variable classification problem. This differential performance demonstrates our framework's versatility in handling diverse tabular data tasks.
Another interesting observation is that within the AutoKaggle framework, the GPT-4o model achieved better results than the o1-mini model, despite the latter's purported superior reasoning capabilities. This performance difference emerged solely from varying the model used in the Planner component. We hypothesize that this counterintuitive result stems from o1-mini's tendency toward excessive planning complexity, which proves disadvantageous in our streamlined, phase-based workflow architecture. This same consideration influenced our decision to maintain GPT-4o as the Developer's base model, as our experiments indicated that an o1-mini-based Developer would significantly increase code verbosity, expanding 100-line solutions to approximately 500 lines through the introduction of superfluous components such as logging systems.
Comprehensive Score. Subsequently, we compared the overall performance of different settings across 8 Kaggle tasks. AutoKaggle with GPT-4o achieved the highest comprehensive score in 5 tasks and demonstrated the best overall performance. As shown in the figure above, the comparison of different settings based on the average normalized performance score metric shows that AutoKaggle with o1-mini achieved the highest overall score. This indicates that although the o1-mini-based Planner generated overly complex plans that increased development difficulty, successfully executing these plans according to specifications led to superior performance outcomes.
@misc{li2024autokagglemultiagentframeworkautonomous,
title={AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions},
author={Ziming Li and Qianbo Zang and David Ma and Jiawei Guo and Tianyu Zheng and Minghao liu and Xinyao Niu and Yue Wang and Jack Yang and Jerry Liu and Wanjun Zhong and Wangchunshu Zhou and Wenhao Huang and Ge Zhang},
year={2024},
eprint={2410.20424},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.20424},
}