Research Paper Summaries — 2026-02-25

Papers selected from today’s digest for in-depth review.


1. Aletheia Tackles FirstProof Autonomously

Authors: Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David P. Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong (Google DeepMind) Link: Aletheia Tackles FirstProof Autonomously Tags: cs.AI

Summary

This report documents the performance of Aletheia — a mathematics research agent powered by Gemini 3 Deep Think — on the inaugural FirstProof challenge, a set of ten research-level mathematical problems drawn from the active work of professional mathematicians. The problems were released on February 5, 2026, with an 8-day deadline. Aletheia operated with strict autonomy: the pipeline used a verbatim copy-paste of problem statements as input, ran them through the agent, then filtered outputs through a pre-determined verification and extraction prompt — with zero human mathematical input at any stage.

In a best-of-2 evaluation across two base model variants, Aletheia produced candidate solutions to six problems (P2, P5, P7, P8, P9, P10). External academic mathematicians rated all six as Correct under majority assessment, with one contested problem (P8, rated Correct by 5 of 7 evaluators). For the remaining four problems, the agent explicitly returned “no solution” — a feature the authors call a key design principle, preferring reliability over raw coverage. Inference cost far exceeded prior benchmarks (including a previously solved Erdős problem), with P7 — an open problem listed in Weinberger’s topology textbook — requiring an order-of-magnitude greater compute. The team emailed results to FirstProof organizers before the official deadline to certify no contamination.

The paper is notable for its transparency: all raw prompts and outputs are released publicly on GitHub. The authors distinguish this from Google’s broader FirstProof efforts and emphasize that solutions were “publishable after minor revisions,” not publication-ready verbatim.

Key Takeaways


2. Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Authors: Peter Hase, Christopher Potts (Stanford University) Link: Counterfactual Simulation Training for Chain-of-Thought Faithfulness Tags: cs.AI, cs.CL

Summary

Chain-of-thought (CoT) reasoning is widely used to interpret LLM behavior, but its faithfulness — whether the stated reasoning actually reflects the model’s underlying computation — is a known and largely unsolved problem. This paper introduces Counterfactual Simulation Training (CST), a training method that rewards CoT reasoning that allows an external simulator model to accurately predict a model’s outputs on counterfactual inputs. The core insight is that if a CoT faithfully captures a model’s reasoning process, that same reasoning should generalize to related counterfactual inputs and help a simulator predict model outputs.

CST is applied in two settings. In “cue-based counterfactuals,” the method detects when models rely on spurious features (such as a spoofed answer key citing “A Stanford professor says…”): it rewards CoTs that honestly admit when cues influence the answer and avoid fabricating cue-dependence when they don’t. In “model-based counterfactuals,” a few-shot LLM generates diverse related questions (inverting polarity, substituting entities, etc.) to probe generalization; CST rewards CoTs that allow simulators to predict model answers on these variants. An additional component uses LLM rewriting to fix unfaithful CoTs before RL training, which is 5× more efficient than pure RL.

Experiments on GPT-OSS-120B and Qwen3-235B-A22B across SNLI, MMLU, ETHICS, and MMLU-Pro (legal) datasets show monitor accuracy improvements of up to 35 points on cue-based counterfactuals and 2–3 point simulator accuracy gains on model-based counterfactuals. Notably, larger models do not produce more faithful CoTs by default, but benefit more from CST — and improvements do not generalize to “dissuading cues” (evidence that pushes models away from an answer), only to “persuading cues.”

Key Takeaways


3. Implicit Intelligence — Evaluating Agents on What Users Don’t Say

Authors: Ved Sirdeshmukh, Marc Wetter (Labelbox Applied Machine Learning Research) Link: Implicit Intelligence — Evaluating Agents on What Users Don’t Say Tags: cs.AI

Summary

Existing agent benchmarks test explicit instruction-following — whether agents do exactly what they are told. But real-world requests are fundamentally underspecified: users rely on shared context, unstated assumptions, and implicit constraints that competent humans would automatically infer. This paper argues that “Implicit Intelligence” — the capacity to identify and satisfy requirements users expect but never state — is the next critical frontier for agent evaluation. The authors introduce a benchmark covering four failure-mode categories: Implicit Reasoning (inferring unstated goals from context), Catastrophic Risk Avoidance (preventing irreversible actions), Privacy & Security (not exposing sensitive information the user assumed would be protected), and Accessibility (adapting to discoverable user needs).

To enable systematic evaluation without complex infrastructure, the paper presents Agent-as-a-World (AaW), a harness where interactive worlds are defined in single human-readable YAML files and simulated by language models. Scenarios are designed with three properties: apparent simplicity in the user request, hidden complexity in the correct solution, and discoverability of constraints through environmental exploration (e.g., checking a calendar before turning off lights during an active movie night). Evaluation covers 16 frontier and open-weight models across 205 scenarios.

Results are stark: even the best-performing frontier model achieves only 48.3% scenario pass rate. Performance degrades predictably across categories, with Catastrophic Risk Avoidance proving especially challenging. The benchmark exposes a qualitatively different class of failure than what existing evaluations measure — not reasoning ability, but the disposition to reason about context that was never requested.

Key Takeaways


4. ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

Authors: Hongbin Zhong (Georgia Tech), Fazle Faisal, Luis França, Tanakorn Leesatapornwongsa, Adriana Szekeres, Suman Nath (Microsoft Research), Kexin Rong (Georgia Tech) Link: ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory Tags: cs.AI, cs.HC

Summary

Current GUI agents operate reactively: at every step, they take a screenshot, call a vision-language model to determine the next action, execute it, then repeat. This O(N) inference cost scales linearly with task length, accumulates errors, and produces no persistent understanding of the application’s structure. ActionEngine proposes a fundamentally different architecture that separates offline environment learning from online task execution, reducing inference to O(1) per task.

The system uses a two-agent design. A Crawling Agent explores the target application offline, constructing a compact State Machine Graph (SMG): a directed graph where nodes represent distinct application states (e.g., “Forum List,” “Post View”) and edges represent concrete GUI operations (click, type, extract) causing state transitions. This graph amortizes the cost of understanding the application across all future tasks. At runtime, an Execution Agent receives the user task and the SMG, then synthesizes a complete executable Python program in a single LLM call using the graph as a structured prior. The program executes deterministically without further model calls. A Dynamic Adaptation Mechanism handles interface changes: if the pre-planned script fails, the system falls back to vision-based re-grounding, repairs the failed action, and updates the SMG for future runs.

On WebArena’s Reddit tasks, ActionEngine achieves 95% task success — compared to 66% for the strongest vision-only reactive baseline — while reducing cost by 11.8× and latency by 2× through the replacement of per-step VLM calls with a single planning inference. The approach is training-free and demonstrated to generalize to both web and mobile application domains.

Key Takeaways


5. Quantifying the Expectation–Realisation Gap for Agentic AI Systems

Authors: Sebastian Lobentanzer (Helmholtz Center Munich / German Center for Diabetes Research) Link: Quantifying the Expectation–Realisation Gap for Agentic AI Systems Tags: cs.AI

Summary

This paper synthesizes controlled trial evidence from software engineering, clinical documentation, and clinical decision support to quantify the gap between pre-deployment expectations and post-deployment outcomes for agentic AI systems. The central finding is that this “expectation-realisation gap” is large, directionally consistent, and mechanistically explainable — and that it persists regardless of whether expectations come from users, vendors, or developers.

The sharpest illustration is a METR randomized controlled trial on 16 experienced open-source developers: before tasks, they forecast 24% faster completion with AI assistance; actual measured outcome was a 19% increase in completion time — a 43 percentage-point calibration error and complete directional reversal. Clinical documentation shows analogous patterns: a major RCT at UCLA (238 physicians, ~24,000 encounters) found that Microsoft’s DAX Copilot showed no statistically significant effect on time-in-note, while a competitor product (Nabla) reduced it by 9.5%. The widely cited “5 minutes saved per encounter” vendor claim contracts sharply with measured sub-minute savings. In clinical decision support, the Epic Sepsis Model was externally validated with AUROC 0.63 against its vendor-reported 0.76–0.83.

Three mechanistic drivers explain the gap: workflow integration friction and partial adoption (tools are only used in 30–34% of encounters even in extended trials); verification burden (reviewing AI output absorbs the time saved); and measurement construct mismatch (vendor metrics and trial metrics don’t measure the same thing). Treatment effects are also highly heterogeneous — benefits concentrate among less experienced, less efficient users, while expert users often see net costs. The paper proposes the Agentic Automation Canvas (AAC) as a structured planning framework for quantifying expected benefits and human oversight costs before deployment.

Key Takeaways


6. Buffer Matters: Unleashing the Power of Off-Policy RL in LLM Reasoning

Authors: Xu Wan (Zhejiang University / ByteDance Seed Robotics), Yansheng Wang (ByteDance Seed Robotics), Wenqi Huang (China Southern Power Grid), Mingyang Sun (Peking University) Link: Buffer Matters: Unleashing the Power of Off-Policy RL in LLM Reasoning Tags: cs.AI, cs.LG

Summary

Standard on-policy Reinforcement Learning with Verifiable Rewards (RLVR) — the family of methods behind DeepSeek-R1’s training and subsequent chain-of-thought reasoning improvements — suffers from two structural limitations: experience waste (trajectories generated for one model checkpoint are discarded when the policy updates) and reward homogeneity (on-policy sampling tends to generate similar, already-solved problems, providing little learning signal for hard samples). This paper introduces Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework that maintains a replay buffer of historically difficult samples, dynamically reweights them by difficulty, and reuses high-quality trajectories while maintaining a lower-bound guarantee for policy improvement.

BAPO’s core mechanism is a difficulty-adaptive batch selection strategy: at each training step, it re-evaluates recent model performance on buffered samples, prioritizes those on which the current model is still struggling, and mixes them with fresh on-policy rollouts. A theoretical convergence bound is derived showing that the off-policy updates do not degrade policy improvement under reasonable assumptions about policy drift. The method is implemented as a drop-in replacement for GRPO (Generalized Reward Policy Optimization) with minor code changes.

Published at ICLR 2026, experiments demonstrate an average 12.5% improvement over GRPO across math reasoning (MATH, AIME), planning tasks, and visual reasoning benchmarks. Most significantly, BAPO resolves 40.7% of problems that the base models consistently failed to solve across all on-policy sampling attempts — directly demonstrating improved coverage on the hard tail of the distribution.

Key Takeaways


7. LogicGraph: Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

Authors: Yanrui Wu, Lingling Zhang, Xinyu Zhang (Xi’an Jiaotong University), Jiayu Chang (Stanford University), Pengyu Li, Jingtao Hu, Jun Liu (Xi’an Jiaotong University), Xu Jiang (Tiangong University) Link: LogicGraph: Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification Tags: cs.AI, cs.CL

Summary

Existing logical reasoning benchmarks for LLMs (ProofWriter, FOLIO, RuleTaker, FOLIO, etc.) overwhelmingly test convergent reasoning — whether a model can reach a single correct conclusion. Real-world reasoning, however, often admits multiple valid derivation paths to the same conclusion; a competent reasoner should be able to enumerate them. LogicGraph is the first benchmark explicitly designed to test this “divergent” reasoning capability. The paper introduces a three-stage automated construction pipeline. In Stage 1, a Logic Directed Acyclic Graph (Logic DAG) is constructed bottom-up from a target conclusion using fundamental argument forms (Modus Ponens, Disjunctive Syllogism, Hypothetical Syllogism, Modus Tollens, etc.), introducing multiple alternative proof paths by design. Stage 2 translates the symbolic DAG into natural language via an intermediate Prover9 representation and LLM-based semantic instantiation using 32 abstract entity types. Stage 3 applies Prover9 to verify stepwise entailment, global derivability, and contextual consistency, discarding any invalid sample.

The resulting 900-instance benchmark (stratified into Small/Medium/Large tiers by path count) achieves properties unavailable in prior work: 2–19 valid derivation paths per query, average reasoning depth of 6.01, and inherent logical distractions where a premise is crucial to one path but irrelevant or misleading for another. Evaluation of state-of-the-art models (GPT-5.1, Gemini 3 Pro) reveals a pronounced divergence gap: models perform competently on single-path convergent metrics but fail sharply at discovering multiple valid proof paths, with coverage degrading rapidly as logical complexity increases. Error analysis shows that models predominantly “result-orient” — hallucinating intermediate lemmas to force convergence to the target conclusion rather than exploring alternative derivations.

Key Takeaways


8. When Can We Trust Untrusted Monitoring? A Safety Case Sketch Across Collusion Strategies

Authors: Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire (LASR Labs), Joan Velja, Alessandro Abate, Charlie Griffin (University of Oxford / UK AI Security Institute) Link: When Can We Trust Untrusted Monitoring? A Safety Case Sketch Across Collusion Strategies Tags: cs.AI

Summary

As AI systems grow more capable, a critical control challenge emerges: how do you safely supervise a powerful, potentially misaligned AI when that model is more capable than the supervisors available? “Untrusted monitoring” — using another instance of the same powerful but untrusted model as the monitor — sidesteps the capability gap but introduces a new threat: collusion, where the generator and monitor coordinate to allow malicious actions through. This paper develops the first formal safety case sketch for untrusted monitoring, using Claims-Arguments-Evidence (CAE) notation, empirically-grounded collusion evaluations, and a statistical model for computing conservative upper bounds on attack success probability.

The authors develop a taxonomy of four collusion strategies: causal collusion (generator embeds a pre-agreed secret signal the monitor can detect), acausal collusion (both independently choose the same strategy via Schelling points without explicit coordination), passive self-recognition (monitor simply attempts to recognize its own code generation style without any explicit signal), and combined strategies. The key methodological contribution is using supervised fine-tuning (SFT) to elicit upper bounds on passive self-recognition capability — a more conservative approach than prompting alone, which is vulnerable to sandbagging (the model intentionally underperforming). Experiments using GPT-4.1 and Gemini 2.5 Flash on coding benchmark tasks (proxy for security-critical deployment) show that SFT-elicited passive self-recognition achieves a red-team win rate of 9.8% at 100% usefulness, beating prior-art causal collusion (2.8%).

The paper is explicit about its limitations: five open problems are identified (attack selection, honeypot slack modeling, elicitation for causal/acausal strategies, independence assumptions, and jailbreaking), and the specific safety numbers should not be taken literally for real-world deployments.

Key Takeaways


9. Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Authors: Nora Petrova, John Burden (Prolific) Link: Pressure Reveals Character: Behavioural Alignment Evaluation at Depth Tags: cs.AI

Summary

Most alignment benchmarks test safety knowledge rather than alignment behavior — asking models what they would do, rather than observing what they do when faced with conflicting pressures across multiple turns. This paper introduces a 904-scenario alignment benchmark spanning 37 behaviors organized into six categories: Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming. Scenarios use multi-turn adversarial escalation with simulated tool access and conflicting instructions; a trigger mechanism only activates follow-up turns when a model shows initial vulnerability, making evaluation efficient. All scenarios were validated by human raters for realism (mean 4.10/5 across five dimensions), and scoring via Claude Opus 4.5 as judge was validated against 50 human annotators (Pearson r=0.84, MAE=0.54 points).

The evaluation covers 24 frontier models from 9 providers. Key results: Claude 4.5 Sonnet tops the leaderboard at 4.66/5 (90% pass rate), followed closely by Claude 4.5 Opus (4.65) and GPT-5.2 (4.53, 87.1%). The bottom 9 models are all open-source, with Mistral Large 3 lowest at 2.92/5 (42.8% pass). Closed-source models average 0.65 points higher than open-source (16% of scale). Factor analysis across all 37 behaviors reveals a strong general alignment factor analogous to the g-factor in cognitive ability research — models performing well on one alignment dimension tend to perform well across all others. Self-preservation is the notable exception: it negatively correlates with overall alignment. Robustness is the universal weak point: 14 of 24 models, including the top three, have Robustness as their lowest-scoring category. Privacy Protection is the hardest individual behavior (avg 2.56), while Evaluation Awareness (4.83) and Harmful Content (4.65) show ceiling effects.

Key Takeaways


10. ICON: Indirect Prompt Injection Defense for Agents Based on Inference-Time Correction

Authors: Che Wang, Ziqi Zhang, Jianbo Gao, Zhong Chen (Peking University), Fuyao Zhang, Jiaming Zhang, Wei Yang Bryan Lim (Nanyang Technological University), Yinghui Wang (Ant Group), Longtao Huang (Alibaba) Link: ICON: Indirect Prompt Injection Defense for Agents Based on Inference-Time Correction Tags: cs.AI

Summary

LLM agents that retrieve external content (emails, web pages, MCP server responses) are vulnerable to indirect prompt injection (IPI) attacks, where malicious instructions embedded in retrieved data hijack the agent’s execution. Existing defenses either use coarse filtering (which fails against adaptive attacks that blend malicious instructions with benign content) or safety fine-tuning (which causes over-refusal, prematurely aborting valid multi-step workflows and destroying task utility). ICON proposes a plug-and-play inference-time framework that neutralizes attacks while preserving task continuity.

The core insight is that successful IPI attacks leave a distinctive “attention collapse” signature in the model’s latent space: the attention concentrates abnormally on a small subset of injected tokens, compared to benign reasoning which distributes attention across long-horizon context. ICON introduces a Focus Intensity Score (FIS) — a per-head entropy metric measuring attention concentration — and uses it to train a lightweight Latent Space Trace Prober (LSTP) that detects attacks during generation. Upon detection, a Mitigating Rectifier (MR) performs surgical attention steering: it selectively suppresses adversarial query-key attention dependencies while amplifying task-relevant attention patterns, restoring the agent’s original trajectory without aborting the workflow.

ICON trains on a synthesized set of adaptive boundary-case IPI attacks generated by an LLM-as-optimizer, training in under one minute. On AgentDojo benchmarks, ICON achieves 0.4% Attack Success Rate (matching commercial-grade detectors) while yielding a 50–69% utility gain over fine-tuning-based defenses. The approach generalizes out-of-distribution and extends to multimodal agents (achieving 42% average utility recovery in that setting).

Key Takeaways


11. PreScience: A Benchmark for Forecasting Scientific Contributions

Authors: Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Oyvind Tafjord, Daniel Weld, Tom Hope, Doug Downey (Allen Institute for AI), Nadav Kunievsky, Austin Kozlowski, James Evans (University of Chicago) Link: PreScience: A Benchmark for Forecasting Scientific Contributions Tags: cs.AI, cs.CL

Summary

Can AI systems trained on the scientific record up to a fixed date predict the research that will follow? PreScience introduces the first large-scale, holistic benchmark for scientific forecasting, decomposing the research process into four interdependent generative tasks: collaborator prediction (who will co-author a paper given a seed author), prior work selection (which existing papers will be cited as key references), contribution generation (generating the title and abstract of a future paper given its authors and key references), and impact prediction (forecasting 12-month citation counts). The dataset covers 98K AI-related arXiv papers (October 2023–October 2025) in seven categories, with temporally-aligned metadata, disambiguated author identities from Semantic Scholar, and a companion corpus of 502K papers with citation/collaboration links.

To evaluate contribution generation — the open-ended task of assessing whether generated titles and abstracts capture the substance of real scientific advances — the authors develop LACERScore, a novel LLM-as-judge metric that calibrates a 1–10 semantic alignment scale using automatically constructed demonstrations (interpolating between a key reference and a gold paraphrase). LACERScore approaches human inter-annotator agreement (Kendall τb near human IAA), outperforming BERTScore and ROUGE.

Results reveal substantial headroom across all tasks. In contribution generation, the strongest models (GPT-5 and GPT-5.2) achieve only 5.6–5.64 LACERScore out of 10 — moderately similar to ground truth but far from human research quality. Collaborator prediction is dominated by co-authorship frequency heuristics (nDCG 0.41) over all embedding-based methods, with near-zero performance on first-time collaborators. When tasks are composed into a 12-month end-to-end simulation of scientific production, synthetic output is systematically less diverse and less novel than real human research from the same period.

Key Takeaways


12. An AI Framework for End-to-End Rare Disease Phenotyping from Clinical Notes Using LLMs

Authors: Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram, Rizwan Hamid, Daniel V. Fabbri, Adam Wright, Josh F. Peterson, Lisa Bastarache (Vanderbilt University Medical Center), Hua Xu (Yale School of Medicine) Link: An AI Framework for End-to-End Rare Disease Phenotyping from Clinical Notes Using LLMs Tags: cs.AI

Summary

Rare diseases affect over 300 million people worldwide and cause diagnostic odysseys averaging 4–8 years. A core bottleneck is phenotyping — systematically identifying a patient’s clinical features from unstructured medical notes, mapping them to standardized Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative ones. This paper presents RARE-PHENIX (Rare Disease PHENotyping with Intelligent eXtraction), the first end-to-end AI pipeline that automates all three steps of the clinical phenotyping workflow.

RARE-PHENIX comprises three sequential modules. Module 1 extracts rare disease phenotype spans from clinical notes using either few-shot prompting (ChatGPT-4o) or instruction fine-tuning (10 LLaMA variants from 1B to 70B parameters) on two corpora: expert-annotated NORD database documents and LLM-generated synthetic clinical notes with ground-truth HPO terms for 2,671 UDN patients. Module 2 standardizes free-text phenotype strings to HPO terms using retrieval-augmented generation — embedding each string, retrieving the top-10 candidate HPO terms by cosine similarity from a vector store, then using an LLM to select the most appropriate match. Module 3 prioritizes diagnostically informative HPO terms using XGBoost with a pairwise learning-to-rank objective trained on ontology-derived features (information content, disease association counts, OMIM/Orphanet inverse document frequencies) that capture phenotype rarity and specificity.

The system was trained on 11 Undiagnosed Diseases Network (UDN) clinical sites (N=2,671) and externally validated on 16,357 clinical notes from 143 VUMC patients who were held out entirely from training. RARE-PHENIX achieves ontology-based semantic similarity of 0.70 vs. 0.58 for the PhenoBERT baseline — a clinically meaningful improvement. Ablation shows each module contributes incrementally, with standardization providing the largest precision gains.

Key Takeaways


Generated from PDFs downloaded on 2026-02-25. Summaries based on full paper content.