We introduce MLE-Dojo, a Gym-style framework for systematically training, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges (\eg, tabular data analysis, computer vision, natural language processing, and time series forecasting). MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility.
The MLE-Dojo benchmark comprises over 200 real-world machine learning tasks spanning tabular data, computer vision, NLP, and time series, sourced from Kaggle. Each task is standardized into a unified format—featuring structured descriptions, reorganized datasets, local evaluators, and human leaderboards—designed for seamless interaction with LLM agents. Tasks are selected for their diversity, practical relevance, and validation feasibility, forming a scalable and extensible dataset tailored for training and evaluating autonomous ML agents under realistic, iterative workflows. Users can effortlessly and flexibly incorporate new tasks, enabling seamless adaptation to diverse requirements and application scenarios.
Modular and User-Friendly Interface: The environment is composed of modular components—Error
, Interface
, Feedback
, and Metric
—that are fully decoupled and extensible via a clean registration API. A single env.step
call enables seamless agent-environment interaction, simplifying agent design and integration.
Extensible Task Space: All tasks are isolated in reproducible Docker containers with configurable execution sandboxes. A unified data format standardizes integration, allowing users to add custom competitions with minimal effort, ensuring compatibility and secure agent testing.
Observation Space: Each environment provides rich, structured observations, including competition context, evaluation metrics, code execution results, detailed error messages, and both agent- and environment-side interaction histories. This empowers agents with full situational awareness.
Expandable Action Space: MLE-Dojo
supports five core actions—request_info
, validate_code
, execute_code
, get_history
, and reset
—and allows users to register new actions through a customizable portal, enabling advanced experimentation and behavior design.
Reward Space and Environmental Feedback: Instead of coarse-grained medals, MLE-Dojo
uses the HumanRank Score, a normalized leaderboard-based reward that reflects how well agents perform compared to human participants. This enables unified, fine-grained evaluation across diverse competitions.
o3-mini
, DeepSeek-r1
, and Gemini-2.5-Pro
consistently achieve high rankings across all metrics, demonstrating strong adaptability, robustness, and overall effectiveness as MLE Agents.
MLE-Dojo provides flexible Gym-style APIs that allow users to build personalized environments, introduce new datasets and develop/utilize different agent scaffolds. To facilitate model training via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL), MLE-Dojo provides a detailed history management system and a well-defined reward feedback mechanism, including both final outcome rewards and intermediate step rewards.
MLE-Dojo offers a modular, Gym-style API for building personalized ML environments, integrating datasets, and customizing agent scaffolds. The toolkit supports intuitive interaction with ML competitions through the following key components:
Together, these modules ensure flexible development and secure execution, offering a unified and extensible system for LLM-agent research.
MLE-Dojo tracks both agent behavior and environment responses with structured feedback, enabling training via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). The structured trajectory samples can be found at Agent History and Environment History.
@misc{qiang2025mledojointeractiveenvironmentsempowering,
title={MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering},
author={Rushi Qiang and Yuchen Zhuang and Yinghao Li and Dingu Sagar V K and Rongzhi Zhang and Changhao Li and Ian Shu-Hei Wong and Sherry Yang and Percy Liang and Chao Zhang and Bo Dai},
year={2025},
eprint={2505.07782},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.07782},
}