Papers
arxiv:2603.07300

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Published on Mar 7
ยท Submitted by
Founder of Bibby AI
on Mar 10
Authors:
,
,
,

Abstract

An autonomous reinforcement learning framework conducts continuous neural architecture and hyperparameter research without human intervention, achieving performance comparable to hand-tuned baselines through automated experimentation.

AI-generated summary

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

Community

Paper submitter

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO).
The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (URL) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals.
We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

In Section 9 (Discussion & Limitations) they highlighted the current constraint of isolating the mutable scope to a single file (train.py), and in Section 6.3, they use string edit-distance to calculate the novelty bonus. I wanted to share an architectural layer I've implemented that directly solves these bottlenecks: Replacing the raw-text state representation with a Semantic Code Graph (Abstract Syntax Tree) using an open-source tool called GitNexus. By treating the codebase not as a concatenated 64,000-token string, but as a relational topological graph, Iโ€™ve observed three massive improvements in the autonomous loop: i) Multi-File Orchestration (Breaking the Sandbox): Because the agent can query the AST for cross-file dependencies, it can safely orchestrate complex refactors without generating runtime crashes. ii)True Structural Novelty Rewards: String edit-distance inherently rewards the agent for cosmetic changes (like renaming loss to total_training_loss). By evaluating diffs at the AST level, the reward function can isolate true algorithmic novelty (only rewarding the policy when it introduces a fundamentally new mathematical operator, loop structure, or library import) iii)Context Compression & Quality Constraints: By utilizing AST truncation, the agent only ingests the relevant function signatures it needs to modify, dropping the context payload from 64k tokens down to ~3k tokens. I also add AST-derived cyclomatic complexity penalties directly to the reward function to prevent the agent from writing unmaintainable "spaghetti" code over hundreds of iterations. Hope this helps someone!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07300 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07300 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07300 in a Space README.md to link it from this page.

Collections including this paper 2