Logo AutoKaggle

A Multi-Agent Framework for Autonomous Data Science Competitions

Ziming Li1, Qianbo Zang1,5, David Ma1, Jiawei Guo1, Tuney Zheng1,
Minghao Liu3, Xinyao Niu4, Yue Wang1, Jack Yang1, Jerry Liu1,
Wanjun Zhong2, Wangchunshu Zhou1, Wenhao Huang1,2†, Ge Zhang1,2

1M-A-P, 2ByteDance Inc., 32077AI, 4University of Melbourne,
5Interdisciplinary Centre for Security, Reliability and Trust (SnT), Université du Luxembourg

†Corresponding Authors

🔔News

Abstract

Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.

kaggle_main

Figure 1:Overview of AutoKaggle. AutoKaggle integrates a phase-based workflow with specialized agents (Reader, Planner, Developer, Reviewer, and Summarizer), iterative debugging and unit testing, a comprehensive machine learning tools library, and detailed reporting.

Introduction

Despite the rapid advancement of LLM-based agents in data science automation, existing solutions often struggle with complex, multi-step data science tasks and lack transparency in their decision-making processes. We propose AutoKaggle, a universal multi-agent framework that provides end-to-end processing solutions for tabular data through several key innovations:

  • Phase-based Workflow and Multi-agent Collaboration: AutoKaggle divides the data science competition process into six key phases, executed by five specialized agents (Reader, Planner, Developer, Reviewer, and Summarizer) working collaboratively.
  • Iterative Debugging and Unit Testing: The Developer agent employs code execution, debugging, and unit testing to ensure both syntactic correctness and logical consistency.
  • Machine Learning Tools Library: A comprehensive library of expert-written code snippets and custom tools enhances code generation efficiency while reducing reliance on LLMs for domain-specific knowledge.
  • Comprehensive Reporting: Detailed reports are generated after each phase, making the data processing workflows transparent and increasing user trust.
unit_test

Figure 2: Development based on Iterative debugging and testing.

In our evaluation across 8 Kaggle data science competitions, AutoKaggle achieved a validation submission rate of 0.85 and a comprehensive score of 0.82, demonstrating its effectiveness in handling complex data science tasks while maintaining transparency and interpretability throughout the process.

Machine Learning Tool Schema Example

We provide two schema formats for each machine learning tool: JSON and Markdown. Here, we take the FillMissingValues tool as an example and provide schemas in Markdown format.

Overview

Description: Fill missing values in specified columns of a DataFrame. This tool can handle both numerical and categorical features by using different filling methods.

Applicable Situations: Handle missing values in various types of features.

Core Parameters

    data
    Type: pd.DataFrame
    A pandas DataFrame object representing the dataset.
    columns
    Type: string | array
    The name(s) of the column(s) where missing values should be filled.
    method
    Type: string
    The method to use for filling missing values.
    Enum: auto | mean | median | mode | constant
    Default: auto

Additional Settings

    fill_value
    Type: number | string | null
    The value to use when method is constant.
    Default: None

Requirements & Results

Required Parameters: data, columns

Result: Successfully fill missing values in the specified column(s) of data.

Important Notes

  • The auto method uses mean for numeric columns and mode for non-numeric columns.
  • Using mean or median on non-numeric columns will raise an error.
  • The mode method uses the most frequent value, which may not always be appropriate.
  • Filling missing values can introduce bias, especially if the data is not missing completely at random.
  • Consider the impact of filling missing values on your analysis and model performance.

Experiment

Dataset

We select eight Kaggle competitions that predominantly use tabular datasets, focusing on classification and regression tasks. These competitions are categorized into two types: classic Kaggle and Recent Kaggle. Classic Kaggle competitions are those that begin before October 2023 with at least 500 participants, whereas Recent Kaggle competitions begin in 2024 or later. As our analysis relies on GPT-4o, which is trained on data available until October 2023, it includes most of the Classic Kaggle competitions. To evaluate the generalization capabilities of AutoKaggle, we therefore focus on competitions initiated after 2024. Additionally, we classify these competitions into three difficulty levels: easy, medium, and hard. For each dataset, we access the corresponding competition's homepage on Kaggle, extract content from the overview and data description sections, and compile this information into a file named overview.txt. This file, along with the original competition data files, forms the primary input for AutoKaggle.

Category No. Task Name Task Level Teams Train Test
Classic 1 Titanic Classification Medium 13994 891 418
2 Spaceship Titanic Classification Easy 1720 8693 4277
3 House Prices Regression Medium 4383 1460 1459
4 Monsters Classification Easy 763 371 529
Recent 5 Academic Success Regression Medium 2684 76.5K 51K
6 Bank Churn Regression Easy 3632 165K 110K
7 Obesity Risk Classification Easy 3587 20.8K 13.8K
8 Plate Defect Regression Medium 2199 19.2K 12.8K
Table 1: Selected Kaggle tasks. For each task, we show its number, category, difficulty level, number of teams, train size and test size in dataset.

Main Results

Metric Setting / Task Classic Recent Avg.
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8
Made Submission AutoKaggle gpt-4o 1 0.80 0.80 1 0.80 0.80 0.80 0.80 0.85
AutoKaggle o1-mini 1 0.60 0.60 1 0.60 0.80 0.60 0.60 0.73
AIDE gpt-4o 1 0.40 0.20 0.60 1 0.80 0.80 0 0.60
Valid Submission AutoKaggle gpt-4o 1 0.80 0.80 1 0.80 0.60 0.80 0.80 0.83
AutoKaggle o1-mini 1 0.60 0.60 1 0.60 0.60 0.60 0.60 0.70
AIDE gpt-4o 1 0.40 0.20 0.40 1 0.80 0.80 0 0.58
Comprehensive Score AutoKaggle gpt-4o 0.888 0.786 0.831 0.862 0.810 0.728 0.848 0.812 0.821
AutoKaggle o1-mini 0.879 0.680 0.729 0.863 0.709 0.735 0.742 0.735 0.759
AIDE gpt-4o 0.872 0.597 0.542 0.561 0.918 0.793 0.848 0 0.641
Table 1: Made submission, valid submission and comprehensive score on 8 Kaggle tasks. Each experiment is repeated with 5 trials. The best performances on individual tasks are underlined, and the best performances across all tasks are bolded.

Average NPS for Different Settings/Tasks

Average NPS

Average normalized performance score for different settings/tasks.

Made submission and Valid submission. We first evaluated the success rate of valid submission.csv file generation across different experimental configurations. The AutoKaggle framework, implemented with GPT-4o, demonstrated superior performance with an average valid submission rate of 83% across all 8 Kaggle tasks, surpassing the AIDE framework by 28%. These results underscore the robustness of our framework in executing comprehensive data science workflows. While the AIDE framework successfully processed Tasks 1-7, which involved single-variable classification or regression on tabular data, it failed to generate valid submissions for Task 8, a multi-variable classification problem. This differential performance demonstrates our framework's versatility in handling diverse tabular data tasks.

Another interesting observation is that within the AutoKaggle framework, the GPT-4o model achieved better results than the o1-mini model, despite the latter's purported superior reasoning capabilities. This performance difference emerged solely from varying the model used in the Planner component. We hypothesize that this counterintuitive result stems from o1-mini's tendency toward excessive planning complexity, which proves disadvantageous in our streamlined, phase-based workflow architecture. This same consideration influenced our decision to maintain GPT-4o as the Developer's base model, as our experiments indicated that an o1-mini-based Developer would significantly increase code verbosity, expanding 100-line solutions to approximately 500 lines through the introduction of superfluous components such as logging systems.

Comprehensive Score. Subsequently, we compared the overall performance of different settings across 8 Kaggle tasks. AutoKaggle with GPT-4o achieved the highest comprehensive score in 5 tasks and demonstrated the best overall performance. As shown in the figure above, the comparison of different settings based on the average normalized performance score metric shows that AutoKaggle with o1-mini achieved the highest overall score. This indicates that although the o1-mini-based Planner generated overly complex plans that increased development difficulty, successfully executing these plans according to specifications led to superior performance outcomes.

BibTeX


        @misc{li2024autokagglemultiagentframeworkautonomous,
          title={AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions}, 
          author={Ziming Li and Qianbo Zang and David Ma and Jiawei Guo and Tianyu Zheng and Minghao liu and Xinyao Niu and Yue Wang and Jack Yang and Jerry Liu and Wanjun Zhong and Wangchunshu Zhou and Wenhao Huang and Ge Zhang},
          year={2024},
          eprint={2410.20424},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2410.20424}, 
        }