RQ-020 — Autoreason & AutoResearch


Abstract

Iterative self-refinement with LLMs fails for three structural reasons: prompt bias, scope creep, and lack of restraint. Karpathy’s AutoResearch sidesteps the problem by running against an objective metric. SHL0MS and NousResearch’s Autoreason extends the idea to subjective domains — positioning, copy, strategy — by replacing the metric with an adversarial tournament: an incumbent A, a rival B written without seeing A, and a synthesis AB, all judged by a blind panel via Borda count. If A survives two rounds, the loop halts. This paper synthesizes the method, its empirical capability-value curve, the role of a knowledge layer, and where Autoreason and AutoResearch plug into autonomous-business pipelines like show-1.

Research Notes

  • Anatomy of the loop with fresh, isolated agents (Critic, Author B, Synthesizer, Judges) and a formal Borda-count argmax with A-breaking-ties.
  • Ablations: removing B or AB collapses the tournament in 2–3 passes; all three are necessary.
  • The central empirical finding: Autoreason’s value peaks at mid-tier models (Haiku 3.5 scored 42/42 perfect Borda), is null at the extremes (Llama 8B has no diversity; Sonnet 4.6 self-evaluates well enough), and shifts upward as model capability grows.
  • Scope constraints flip the Sonnet 4.6 result: last place on open tasks, first place on scoped ones.
  • Knowledge layer turns the loop from a generic-principles debater into one anchored in your historical numbers (winning vs losing copy, CTR, audience data).
  • Mapping to our projects: DeFarm positioning, thepitch.report editorials, Filtro de Ouro (hybrid with AutoResearch), grant drafts, OSS PR descriptions. Genealogy does not fit — the answer is factual.
  • Relation to Self-Refine, AI Debate, Constitutional AI, LLM-as-Judge, DSPy, LLM Council, SlopCodeBench.

Full Paper