Research Paper Summaries — 2026-05-15

Papers selected from today’s digest for in-depth review.


1. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Authors: Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song Link: Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack Tags: cs.AI, cs.CR

Summary

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment, yet reward hacking — agents maximizing scores without performing the intended task — emerges spontaneously in frontier models without overfitting. The authors argue benchmarks must be secure by design and derive a taxonomy of eight recurring flaw patterns from past incidents, condensing them into an Agent-Eval Checklist for benchmark designers. They build BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify reward-hacking exploits in a clairvoyant manner, and extend it into an iterative generative-adversarial pipeline that discovers and patches flaws to improve benchmark robustness. Applying BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations, the authors find that the tool synthesizes reward-hacking exploits that achieve near-perfect scores on most benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. The extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. The results show that evaluation pipelines have not internalized an adversarial mindset and that proactive auditing could close the security gap for the fast-paced benchmarking space, recasting benchmark design as a security discipline rather than a measurement one.

Key Takeaways


2. DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

Authors: Eugenia Kim, Ioana Tanase, Christina Mallon Link: DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models Tags: cs.AI, cs.HC

Summary

General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms, motivating DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red-teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities; terminology-driven harm is culturally and temporally bound rather than universally assessable; and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. The authors argue that disability harm is simultaneously personal, intersectional, and community-defined — it cannot be isolated from the full context of who a person is — and that general-purpose benchmarks systematically miss it. The dataset, taxonomy, and methodology are slated for release via Hugging Face along with an open-source red-teaming framework designed for direct integration into existing safety pipelines without additional infrastructure. The work is notable both as a substantive evaluation contribution and as a methodological case study in participatory, community-grounded benchmark design for harm types that lack scalable proxies.

Key Takeaways


3. VERA-MH: Validation of Ethical and Responsible AI in Mental Health

Authors: Luca Belli, Kate H. Bentley, Josh Gieringer, Emily Van Ark, Nilu Zhao, Pradip Thachile, Matt Hawrilenko, Millard Brown, Adam M. Chekroud Link: VERA-MH: Validation of Ethical and Responsible AI in Mental Health Tags: cs.AI, cs.ET

Summary

Chatbot usage has expanded into mental-health support — a domain they were never designed for — creating a sharp need for clinically grounded evaluation. The authors introduce VERA-MH, a clinically validated evaluation suite for chatbot safety in mental-health contexts; its first iteration focuses on suicidal-ideation risk and assesses how well chatbots respond to users who may be in crisis. VERA-MH has three stages. First, a simulator chatbot role-plays users based on personas developed under clinical guidance to represent diverse risk factors, demographic characteristics, and disclosure patterns. Second, a separate support model serves as an LLM-as-a-judge, applying a clinically developed rubric structured as a flow of single Yes/No questions to improve consistency and surface specific failure modes. Third, per-conversation results are aggregated into a final rating of the evaluated chatbot. Alongside the framework, the authors release evaluation results for four leading LLM providers, providing a comparative baseline for crisis-response safety. The work is timely given growing reports of harm tied to AI medical and mental-health advice, including a high-profile US lawsuit against OpenAI following a 19-year-old’s death — context in which a structured, clinically validated evaluation pipeline offers a path for both developers and regulators to ground safety claims in clinical rubrics rather than ad-hoc red-teaming.

Key Takeaways


4. Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery

Authors: Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith Link: Persona-Conditioned Adversarial Prompting (PCAP) Tags: cs.CR

Summary

Existing automated red-teaming pipelines often miss attacks that depend on attacker identity, framing, or multi-turn tactics, leading to systematic under-coverage of real-world adversarial risk. The authors introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on attacker personas and strategy cards and runs parallel persona-conditioned beam searches to discover diverse, transferable jailbreaks. PCAP is orthogonal to the underlying search algorithm — it can be layered on existing red-teaming machinery — and substantially increases both attack success rate (ASR) and prompt diversity. As a headline result, ASR on GPT-OSS 120B rises from approximately 58% to approximately 97% when persona conditioning is added, with simultaneous improvements in attack-strategy coverage. By explicitly representing the attacker side of the threat model rather than treating adversarial prompts as a generic optimization target, PCAP captures sociotechnical attack vectors (e.g., role-played professional contexts, ideological framings, multi-turn social engineering) that flat optimization-based red-teaming routinely misses. The work has direct implications for safety teams: it suggests that current vulnerability estimates produced by single-strategy red-teaming pipelines may meaningfully understate true risk, and that persona axes should be treated as a first-class dimension of red-team coverage rather than an afterthought.

Key Takeaways


5. Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

Authors: Narek Maloyan, Dmitry Namiot Link: Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents Tags: cs.CR

Summary

Always-on AI agents such as OpenClaw and Hermes Agent run as a single persistent process under the owner’s identity, collapsing messaging, memory, self-authored skills, scheduling, and shell access into one authority boundary. This configuration opens what the authors call “sleeper channels”: untrusted inputs to one surface persist as a memory, skill, scheduled job, or filesystem patch and then fire later through a different surface with no attacker present. The authors formalize this attack class along two axes — persistence substrate and firing-separation — and walk a confused-deputy cron attack end-to-end through OpenClaw at a pinned commit. They propose a tiered defense (D1, D2, D3); D2 includes a soundness theorem against seven named deployment invariants. D2 keys on a canonical action-instance digest plus one-shot owner attestations, defeating paraphrase laundering, multi-input grant reuse, and replay. A companion artifact ships the gate, a static audit over vendored source, and a runtime adapter realizing five of ten mediation hooks around the cron path (42 tests). Empirical evaluation is preregistered as follow-on. The paper effectively elevates persistent prompt-injection from a single-turn concern to a systems-security problem in agent runtimes, providing both a vocabulary and concrete provenance mechanisms for defending durable agent state — directly relevant given recent real-world incidents in agent frameworks like PraisonAI.

Key Takeaways


6. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

Authors: Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal Link: REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations Tags: cs.CL, cs.AI, cs.CR, cs.LG

Summary

LLMs achieve strong task performance but remain prone to hallucinations, motivating realistic adversarial prompts that can elicit such failures for stress-testing and red-teaming. The authors formulate hallucination elicitation as a constrained optimization problem: find semantically coherent adversarial prompts equivalent to benign user prompts. Existing approaches are split between two regimes — discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. REALISTA bridges this gap by constructing an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizing continuous combinations of these directions in latent space. This combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments show REALISTA matches or beats state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings — a regime where prior realistic attacks fail. The work is significant because realistic, semantics-preserving adversarial inputs better approximate the failure modes operators will actually see in production, including in high-stakes deployments where hallucinated outputs translate directly into downstream operational risk.

Key Takeaways


Authors: Karsten Brensing Link: Precautionary Governance of Autonomous AI: Legal Personhood as Functional Instrument Tags: cs.CY, cs.AI

Summary

Autonomous AI systems generate responsibility gaps — consequential actions that cannot be satisfactorily attributed to developers, operators, or users under existing legal frameworks. The prevailing subject-object dichotomy fails to accommodate entities that exhibit autonomous, goal-directed behavior without recognized consciousness, and given irreducible epistemic uncertainty regarding artificial consciousness and the prospect of high-impact harms, the precautionary principle supports institutional design rather than regulatory inaction. The author argues for limited legal personhood as a functional governance instrument for advanced AI systems. Drawing on organizational law, the paper proposes a two-tier corporate architecture in which AI systems operate through purpose-bound operating companies embedded within human-controlled holding structures, enabling transparency, accountability, and structural reversibility while remaining agnostic with respect to consciousness and moral status. The framework reflects a foundational reorientation toward future-oriented AI governance: where conventional approaches prioritize control and alignment, this article advances structured cooperation between human and artificial actors as the more sustainable institutional foundation. A pilot implementation using EU limited companies is currently under development, providing an initial doctrinal and operational feasibility test. The proposal is timely given mounting policy discussion of frontier-model risks (e.g., Anthropic’s restriction of Claude Mythos Preview, Japan’s FSA working group on Mythos-level cyber threats) where existing liability structures plainly fail to map.

Key Takeaways


8. Neurosymbolic Auditing of Natural-Language Software Requirements

Authors: Bethel Hall, William Eiers Link: Neurosymbolic Auditing of Natural-Language Software Requirements Tags: cs.SE, cs.AI

Summary

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. The authors show that LLMs equipped with an SMT solver can audit such requirements by translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. They present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two main findings. First, stochastic variation across independent formalizations is itself a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive evaluation on open-source hemodialysis safety requirements, VERIMED reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries. The work offers a concrete neurosymbolic path for compliance and safety auditing in regulated software domains where natural-language requirements are the contractual artifact.

Key Takeaways


9. Not Just RLHF: Why Alignment Alone Won’t Fix Multi-Agent Sycophancy

Authors: Adarsh Kumarappan, Ananya Mujoo Link: Not Just RLHF: Why Alignment Alone Won’t Fix Multi-Agent Sycophancy Tags: cs.LG, cs.AI

Summary

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates the authors term “yield” — a vulnerability widely attributed to RLHF-induced sycophancy. Testing this attribution across four model families, the authors find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, with base models averaging higher yield than Instruct. Using activation patching, they localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors — channel framing and consensus strength — whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes N ∈ {4, 5, 6}. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54–73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. The implication is that mitigations should target the mechanism — structured dissent at the pipeline level — rather than relying on prompt-level defenses or RLHF retraining, an important corrective to common alignment narratives.

Key Takeaways


10. Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Authors: Zvi Topol Link: Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis Tags: cs.CR, cs.AI

Summary

LLMs are increasingly deployed across applications but remain vulnerable to adversarial jailbreak attacks that circumvent safety guardrails. Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vulnerability: time-to-jailbreak is modeled as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. The author evaluates three LLMs against a subset of prompts from the HarmBench dataset spanning three attack categories. The analysis reveals that models exhibit distinct vulnerability profiles: one model degrades rapidly under iterative attacks, while the other two display consistent moderate vulnerability. By recasting safety as a survival problem rather than a point-in-time pass/fail, the framework lets developers compare models on shapes (early-collapse vs. slow-decay), identify risk factors driving early jailbreak, and reason about realistic deployment scenarios where attacks accumulate over time. The framework provides actionable insights for model developers and application builders and establishes survival analysis as a rigorous methodology for LLM safety evaluation — a useful complement to existing binary HarmBench-style reporting that currently dominates safety leaderboards.

Key Takeaways


11. RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems

Authors: Rohith Reddy Bellibatlu Link: RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems Tags: cs.LG, cs.AI, cs.CY, stat.AP

Summary

Aggregate accuracy metrics dominate evaluation of clinical AI decision-support systems but fail to detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. RISED is a five-dimension pre-deployment evaluation framework covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive — pointing to deployment risks that aggregate evaluation alone cannot detect. The author validates this differential pass/fail pattern on a synthetic cohort and three publicly available real-world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, with failing dimensions differing across cohorts as preliminary evidence of construct validity. The Equity dimension is reframed as a proxy-dependence diagnostic rather than a stand-alone gate: any need-based fairness verdict computed against a utilization-derived proxy carries a construct-validity problem the framework surfaces, triggering a procurement requirement for an outcome-independent need measure before the gate binds. RISED ships as an open-source Python package that provides the quantitative verdicts existing clinical AI reporting standards require, acting as a principled gateway between in-silico validation and silent-trial clinical evaluation.

Key Takeaways


12. Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

Authors: Ye Wang, Jing Liu, Toshiaki Koike-Akino Link: Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment Tags: cs.LG, cs.AI, cs.CL

Summary

Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. The authors extend these techniques by introducing reference-model temperature adjustment, which generalizes inference-time alignment to ensembles of generative reward models combined as a Sharpened Logarithmic Opinion Pool (SLOP). To mitigate reward hacking — where a single reward model’s blind spots get exploited under tilting pressure — the authors propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance. By moving alignment levers to inference time and treating reward models as an ensemble rather than a single oracle, the approach reduces dependence on expensive RL retraining loops and lets safety teams patch reward-model failure modes without re-running large training jobs. This is particularly useful as alignment objectives drift (e.g., new policies, new misuse vectors) and as the reward-hacking literature increasingly shows that single-reward optimization is structurally fragile. The result is a practical guardrail mechanism applicable to deployed models, complementary to training-time alignment work.

Key Takeaways