arxiv:2603.07300

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Published on Mar 7

· Submitted by

Founder of Bibby AI on Mar 10

Anthropic

Upvote

Authors:

Abstract

An autonomous reinforcement learning framework conducts continuous neural architecture and hyperparameter research without human intervention, achieving performance comparable to hand-tuned baselines through automated experimentation.

AI-generated summary

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

View arXiv page View PDF Add to collection

Community

BibbyResearch

Paper submitter 4 days ago

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO).
The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (URL) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals.
We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ottawadan

about 6 hours ago

•

edited about 5 hours ago

In Section 9 (Discussion & Limitations) they highlighted the current constraint of isolating the mutable scope to a single file (train.py), and in Section 6.3, they use string edit-distance to calculate the novelty bonus. I wanted to share an architectural layer I've implemented that directly solves these bottlenecks: Replacing the raw-text state representation with a Semantic Code Graph (Abstract Syntax Tree) using an open-source tool called GitNexus. By treating the codebase not as a concatenated 64,000-token string, but as a relational topological graph, I’ve observed three massive improvements in the autonomous loop: i) Multi-File Orchestration (Breaking the Sandbox): Because the agent can query the AST for cross-file dependencies, it can safely orchestrate complex refactors without generating runtime crashes. ii)True Structural Novelty Rewards: String edit-distance inherently rewards the agent for cosmetic changes (like renaming loss to total_training_loss). By evaluating diffs at the AST level, the reward function can isolate true algorithmic novelty (only rewarding the policy when it introduces a fundamentally new mathematical operator, loop structure, or library import) iii)Context Compression & Quality Constraints: By utilizing AST truncation, the agent only ingests the relevant function signatures it needs to modify, dropping the context payload from 64k tokens down to ~3k tokens. I also add AST-derived cyclomatic complexity penalties directly to the reward function to prevent the agent from writing unmaintainable "spaghetti" code over hundreds of iterations. Hope this helps someone!