AutoKaggle

Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.

Figure 1:Overview of AutoKaggle. AutoKaggle integrates a phase-based workflow with specialized agents (Reader, Planner, Developer, Reviewer, and Summarizer), iterative debugging and unit testing, a comprehensive machine learning tools library, and detailed reporting.

Despite the rapid advancement of LLM-based agents in data science automation, existing solutions often struggle with complex, multi-step data science tasks and lack transparency in their decision-making processes. We propose AutoKaggle, a universal multi-agent framework that provides end-to-end processing solutions for tabular data through several key innovations:

Phase-based Workflow and Multi-agent Collaboration: AutoKaggle divides the data science competition process into six key phases, executed by five specialized agents (Reader, Planner, Developer, Reviewer, and Summarizer) working collaboratively.
Iterative Debugging and Unit Testing: The Developer agent employs code execution, debugging, and unit testing to ensure both syntactic correctness and logical consistency.
Machine Learning Tools Library: A comprehensive library of expert-written code snippets and custom tools enhances code generation efficiency while reducing reliance on LLMs for domain-specific knowledge.
Comprehensive Reporting: Detailed reports are generated after each phase, making the data processing workflows transparent and increasing user trust.

Figure 2: Development based on Iterative debugging and testing.

In our evaluation across 8 Kaggle data science competitions, AutoKaggle achieved a validation submission rate of 0.85 and a comprehensive score of 0.82, demonstrating its effectiveness in handling complex data science tasks while maintaining transparency and interpretability throughout the process.

We provide two schema formats for each machine learning tool: JSON and Markdown. Here, we take the FillMissingValues tool as an example and provide schemas in Markdown format.

Overview

Description: Fill missing values in specified columns of a DataFrame. This tool can handle both numerical and categorical features by using different filling methods.

Applicable Situations: Handle missing values in various types of features.

Core Parameters

data

Type: pd.DataFrame

A pandas DataFrame object representing the dataset.

columns

Type: string | array

The name(s) of the column(s) where missing values should be filled.

method

Type: string

The method to use for filling missing values.

Enum: auto | mean | median | mode | constant

Default: auto

Additional Settings

fill_value

Type: number | string | null

The value to use when method is constant.

Default: None

Requirements & Results

Required Parameters: data, columns

Result: Successfully fill missing values in the specified column(s) of data.

Important Notes

The auto method uses mean for numeric columns and mode for non-numeric columns.
Using mean or median on non-numeric columns will raise an error.
The mode method uses the most frequent value, which may not always be appropriate.
Filling missing values can introduce bias, especially if the data is not missing completely at random.
Consider the impact of filling missing values on your analysis and model performance.

Dataset

We select eight Kaggle competitions that predominantly use tabular datasets, focusing on classification and regression tasks. These competitions are categorized into two types: classic Kaggle and Recent Kaggle. Classic Kaggle competitions are those that begin before October 2023 with at least 500 participants, whereas Recent Kaggle competitions begin in 2024 or later. As our analysis relies on GPT-4o, which is trained on data available until October 2023, it includes most of the Classic Kaggle competitions. To evaluate the generalization capabilities of AutoKaggle, we therefore focus on competitions initiated after 2024. Additionally, we classify these competitions into three difficulty levels: easy, medium, and hard. For each dataset, we access the corresponding competition's homepage on Kaggle, extract content from the overview and data description sections, and compile this information into a file named overview.txt. This file, along with the original competition data files, forms the primary input for AutoKaggle.

Category	No.	Task Name	Task	Level	Teams	Train	Test
Classic	1	Titanic	Classification	Medium	13994	891	418
	2	Spaceship Titanic	Classification	Easy	1720	8693	4277
	3	House Prices	Regression	Medium	4383	1460	1459
	4	Monsters	Classification	Easy	763	371	529

Recent	5	Academic Success	Regression	Medium	2684	76.5K	51K
	6	Bank Churn	Regression	Easy	3632	165K	110K
	7	Obesity Risk	Classification	Easy	3587	20.8K	13.8K
	8	Plate Defect	Regression	Medium	2199	19.2K	12.8K

Table 1: Selected Kaggle tasks. For each task, we show its number, category, difficulty level, number of teams, train size and test size in dataset.

Main Results

Metric	Setting / Task	Classic				Recent				Avg.
Metric	Setting / Task	Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8	Avg.
Made Submission	AutoKaggle gpt-4o	1	0.80	0.80	1	0.80	0.80	0.80	0.80	0.85
	AutoKaggle o1-mini	1	0.60	0.60	1	0.60	0.80	0.60	0.60	0.73
	AIDE gpt-4o	1	0.40	0.20	0.60	1	0.80	0.80	0	0.60

Valid Submission	AutoKaggle gpt-4o	1	0.80	0.80	1	0.80	0.60	0.80	0.80	0.83
	AutoKaggle o1-mini	1	0.60	0.60	1	0.60	0.60	0.60	0.60	0.70
	AIDE gpt-4o	1	0.40	0.20	0.40	1	0.80	0.80	0	0.58

Comprehensive Score	AutoKaggle gpt-4o	0.888	0.786	0.831	0.862	0.810	0.728	0.848	0.812	0.821
	AutoKaggle o1-mini	0.879	0.680	0.729	0.863	0.709	0.735	0.742	0.735	0.759
	AIDE gpt-4o	0.872	0.597	0.542	0.561	0.918	0.793	0.848	0	0.641

Table 1: Made submission, valid submission and comprehensive score on 8 Kaggle tasks. Each experiment is repeated with 5 trials. The best performances on individual tasks are underlined, and the best performances across all tasks are bolded.

Average NPS for Different Settings/Tasks

Average normalized performance score for different settings/tasks.

Made submission and Valid submission. We first evaluated the success rate of valid submission.csv file generation across different experimental configurations. The AutoKaggle framework, implemented with GPT-4o, demonstrated superior performance with an average valid submission rate of 83% across all 8 Kaggle tasks, surpassing the AIDE framework by 28%. These results underscore the robustness of our framework in executing comprehensive data science workflows. While the AIDE framework successfully processed Tasks 1-7, which involved single-variable classification or regression on tabular data, it failed to generate valid submissions for Task 8, a multi-variable classification problem. This differential performance demonstrates our framework's versatility in handling diverse tabular data tasks.

Another interesting observation is that within the AutoKaggle framework, the GPT-4o model achieved better results than the o1-mini model, despite the latter's purported superior reasoning capabilities. This performance difference emerged solely from varying the model used in the Planner component. We hypothesize that this counterintuitive result stems from o1-mini's tendency toward excessive planning complexity, which proves disadvantageous in our streamlined, phase-based workflow architecture. This same consideration influenced our decision to maintain GPT-4o as the Developer's base model, as our experiments indicated that an o1-mini-based Developer would significantly increase code verbosity, expanding 100-line solutions to approximately 500 lines through the introduction of superfluous components such as logging systems.

Comprehensive Score. Subsequently, we compared the overall performance of different settings across 8 Kaggle tasks. AutoKaggle with GPT-4o achieved the highest comprehensive score in 5 tasks and demonstrated the best overall performance. As shown in the figure above, the comparison of different settings based on the average normalized performance score metric shows that AutoKaggle with o1-mini achieved the highest overall score. This indicates that although the o1-mini-based Planner generated overly complex plans that increased development difficulty, successfully executing these plans according to specifications led to superior performance outcomes.

BibTeX


        @misc{li2024autokagglemultiagentframeworkautonomous,
          title={AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions}, 
          author={Ziming Li and Qianbo Zang and David Ma and Jiawei Guo and Tianyu Zheng and Minghao liu and Xinyao Niu and Yue Wang and Jack Yang and Jerry Liu and Wanjun Zhong and Wangchunshu Zhou and Wenhao Huang and Ge Zhang},
          year={2024},
          eprint={2410.20424},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2410.20424}, 
        }

AutoKaggle

A Multi-Agent Framework for Autonomous Data Science Competitions

🔔News

Abstract

Introduction

Machine Learning Tool Schema Example