ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
Abstract
ARIS is an open-source research harness that uses cross-model adversarial collaboration to ensure reliable long-term research outcomes through coordinated execution, orchestration, and assurance layers.
This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.
Community
ARIS coordinates ML research workflows through cross-model adversarial collaboration: an executor (Claude) and a reviewer from a different model family (GPT) catch correlated errors that same-model self-refinement misses. Ships 65+ Markdown-defined skills, 5 end-to-end workflows, and a 3-stage evidence-to-claim audit cascade. Already open-source on GitHub with 8k+ stars, 5000+ users and 30+ community-contributed skills; executor-agnostic across Claude Code, Codex CLI, and Cursor.
Awesome work! Can’t wait for the PPT workflow to be released—it’ll be very helpful to follow the pipeline clearly.
Awesome work! Can’t wait for the PPT workflow to be released—it’ll be very helpful to follow the pipeline clearly.
We will release ARIS-Slides workflow in 2 days, and I wish u like it!
Very nice Work!! This is a highly effective report, and ARIS already demonstrates a complete loop from research ideation and experimentation to review-driven revision and paper writing; I especially look forward to the integration of powerful drawing workflows such as GPT-image-2, which could further enhance conceptual figures, workflow diagrams, paper illustrations, and visual storytelling to enrich the overall paper-writing process.
In fact, the good news is that we have already supported GPT-image-2 and nano in our workflow and skills. Once u have codex, we will call the codex to generate paper figures, as shown in https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep/blob/main/skills/paper-illustration-image2/SKILL.md. But there are some important question for the science figure and this skills can be improved (a important future direction).
Very interesting work. In my own experience with auto-research systems, one of the biggest challenges is that my Agent Team often “convinces” me that a task has been completed or that an experiment has reached SOTA performance. However, when I manually inspect the results, the actual performance is often far from what the agents claimed. This makes evidence verification and adversarial review especially important for long-horizon research agents.
There is zero experiment or result in the paper , at least one to convince researcher of its capabilities; I am doing an adversarial review (I am human).
There is zero experiment or result in the paper , at least one to convince researcher of its capabilities; I am doing an adversarial review (I am human).
Very nice question! we will release more papers generated by ARIS (with auto writing and auto experiment queue) later. Coming soon~ This report mainly contains the motivation of ARIS. In the ARIS github, there are some commuity generation paper resutls are provided.
Thanks again~
Get this paper in your agent:
hf papers read 2605.03042 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper