MLE-Dojo Logo

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Georgia Institute of Technology    Stanford University
*Indicates Equal Contribution
MY ALT TEXT

MLE-Dojo is a Gym-style framework for systematically training, evaluating, and improving large language model (LLM) agents in iterative machine learning engineering (MLE) workflows.

Abstract

We introduce MLE-Dojo, a Gym-style framework for systematically training, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges (\eg, tabular data analysis, computer vision, natural language processing, and time series forecasting). MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility.

Introduction

MLE-Dojo serves as a systematic framework for training, evaluating, and improving MLE agents, with four key features:

MY ALT TEXT

Data

The MLE-Dojo benchmark comprises over 200 real-world machine learning tasks spanning tabular data, computer vision, NLP, and time series, sourced from Kaggle. Each task is standardized into a unified format—featuring structured descriptions, reorganized datasets, local evaluators, and human leaderboards—designed for seamless interaction with LLM agents. Tasks are selected for their diversity, practical relevance, and validation feasibility, forming a scalable and extensible dataset tailored for training and evaluating autonomous ML agents under realistic, iterative workflows. Users can effortlessly and flexibly incorporate new tasks, enabling seamless adaptation to diverse requirements and application scenarios.

MY ALT TEXT

Key Features

Modular and User-Friendly Interface: The environment is composed of modular components—Error, Interface, Feedback, and Metric—that are fully decoupled and extensible via a clean registration API. A single env.step call enables seamless agent-environment interaction, simplifying agent design and integration.

Extensible Task Space: All tasks are isolated in reproducible Docker containers with configurable execution sandboxes. A unified data format standardizes integration, allowing users to add custom competitions with minimal effort, ensuring compatibility and secure agent testing.

Observation Space: Each environment provides rich, structured observations, including competition context, evaluation metrics, code execution results, detailed error messages, and both agent- and environment-side interaction histories. This empowers agents with full situational awareness.

Expandable Action Space: MLE-Dojo supports five core actions—request_info, validate_code, execute_code, get_history, and reset—and allows users to register new actions through a customizable portal, enabling advanced experimentation and behavior design.

Reward Space and Environmental Feedback: Instead of coarse-grained medals, MLE-Dojo uses the HumanRank Score, a normalized leaderboard-based reward that reflects how well agents perform compared to human participants. This enables unified, fine-grained evaluation across diverse competitions.

Experiments

We evaluate 8 mainstream LLMs as MLE Agents on 50 MLE-Dojo Evaluation tasks, which covers Tabular, NLP, CV, Time Series and so on. To ensure a comprehensive evaluation, we consider Area Under the Performance Profile (AUP), HumanRank Score (H-Rank, %), and Elo ranking together as metrics. We actively maintain a long-term real-time leaderboard to foster community-driven innovation.

Main Results

Reasoning and coding models such as o3-mini, DeepSeek-r1, and Gemini-2.5-Pro consistently achieve high rankings across all metrics, demonstrating strong adaptability, robustness, and overall effectiveness as MLE Agents. MY ALT TEXT

Difficulty

We define the difficulty level of different tasks with the average performance of different models in comparison with the human leaderboard. As shown in the figure, CV tasks are the most challenging-none of them have an average HumanRank score above 60, and more than half fall below 30. For MLE-Lite tasks, the average HumanRank scores mostly exceed 30. Difficulty levels of tasks in other domains are more evenly distributed.

MY ALT TEXT

Develop with MLE-Dojo

MLE-Dojo provides flexible Gym-style APIs that allow users to build personalized environments, introduce new datasets and develop/utilize different agent scaffolds. To facilitate model training via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL), MLE-Dojo provides a detailed history management system and a well-defined reward feedback mechanism, including both final outcome rewards and intermediate step rewards.

Interface and APIs

📘 Quick API Example

MLE-Dojo offers a modular, Gym-style API for building personalized ML environments, integrating datasets, and customizing agent scaffolds. The toolkit supports intuitive interaction with ML competitions through the following key components:

Together, these modules ensure flexible development and secure execution, offering a unified and extensible system for LLM-agent research.

Collect Trajectories for Training

MLE-Dojo tracks both agent behavior and environment responses with structured feedback, enabling training via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). The structured trajectory samples can be found at Agent History and Environment History.

BibTeX

@misc{qiang2025mledojointeractiveenvironmentsempowering,
      title={MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering}, 
      author={Rushi Qiang and Yuchen Zhuang and Yinghao Li and Dingu Sagar V K and Rongzhi Zhang and Changhao Li and Ian Shu-Hei Wong and Sherry Yang and Percy Liang and Chao Zhang and Bo Dai},
      year={2025},
      eprint={2505.07782},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.07782}, 
}