Research Paper Summaries — 2026-02-28

Papers selected from the most recent ArXiv feed (2026-02-27; ArXiv does not publish on Saturdays) for in-depth review. These papers were not covered in the 2026-02-27 summaries.


1. Reinforcement-aware Knowledge Distillation for LLM Reasoning

Authors: Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto Link: Reinforcement-aware Knowledge Distillation for LLM Reasoning Tags: cs.LG, cs.CL

Summary

RL post-training has produced impressive reasoning models, but their high inference cost motivates distillation into smaller students. Existing knowledge distillation methods were designed for supervised fine-tuning and suffer from two failures when combined with RL: (1) distribution mismatch — teacher supervision does not align with the student’s evolving rollout distribution, and (2) objective interference — KL regularization competes with reward maximization. This paper from AWS Agentic AI proposes RLAD (RL-Aware Distillation), which performs selective imitation during RL — guiding the student toward the teacher only when doing so improves the current policy update. The core mechanism, Trust Region Ratio Distillation (TRRD), replaces the conventional teacher-student KL term with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher-old-policy mixture. This produces advantage-aware, trust-region-bounded imitation computed on the student’s own rollouts, naturally integrating exploration, exploitation, and teacher guidance in one objective.

Experiments on the K&K Logistics reasoning benchmark show RLAD consistently outperforms vanilla GRPO, offline distillation, and the competing KDRL method. On Qwen3-0.6B at 8K context, RLAD reaches an average accuracy of 0.94 across difficulty subsets (PPL2–PPL8) versus 0.76 for GRPO and 0.92 for KDRL. On Qwen3-1.7B, RLAD achieves near-perfect scores. The method also shows faster training convergence. Notable limitation: evaluation focuses on logical reasoning; broader coverage of math/coding benchmarks is not reported.

Key Takeaways


2. Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

Authors: Qin-Wen Luo, Sheng Ren, Xiang Chen, Rui Liu, Jun Fang, Naiqiang Tan, Sheng-Jun Huang Link: Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning Tags: cs.LG, cs.CL

Summary

Reducing the verbosity of chain-of-thought reasoning is important for real-world deployment, but existing RL-based length compression methods fail by triggering entropy collapse: aggressively penalizing long responses shrinks the exploration space prematurely, particularly harming difficult problems that require extensive deduction. This paper from NUAA and Didi identifies entropy collapse as the root cause of accuracy degradation in prior work, not the compression objective per se.

The proposed framework CEEH (Compress the Easy, Explore the Hard) dynamically assesses per-instance difficulty and applies selective entropy regularization: for currently hard questions, it preserves a diverse search space to prevent premature convergence; for easy questions with established reasoning paths, it permits aggressive length compression. Two complementary mechanisms are introduced: CEEH-ME (maximum-entropy loss to maintain exploration diversity) and CEEH-EA (entropy-based advantage rescaling). Additionally, a dynamic optimal-length penalty anchored to the historically shortest correct solution prevents the model from settling for unnecessarily long paths once a shorter one is discovered.

Experiments on reasoning benchmarks show CEEH maintains or improves accuracy while substantially reducing token costs. On AIME25, CEEH-ME improves from 63.3 (base) to 70.0 using R1-Distill-Qwen2.5-7B, while retaining performance on GSM8K and MATH500. The difficulty-aware approach prevents the entropy collapse observed in baselines like Length-Penalty and LC-R1.

Key Takeaways


3. Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Authors: Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang Link: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training Tags: cs.CL, cs.AI

Summary

Agentic RAG — where LLMs dynamically decide when and what to retrieve in multi-step reasoning — is typically trained with outcome-only RL rewards. This causes three problems: sparse rewards that discard intermediate signals, zero learning from partially correct trajectories, and slow convergence requiring 150+ training steps. Search-R1, the prior state-of-the-art, suffers from all three. This paper from Tencent proposes SEARCH-P1, which introduces path-centric reward shaping as the solution.

SEARCH-P1 has two core components. First, a Path-Centric Reward that evaluates the structural quality of a reasoning trajectory through order-agnostic step coverage (rewarding correct retrieval queries regardless of position) and soft scoring that extracts learning signals even from failed samples. Second, Dual-Track Path Scoring, which uses offline-generated reference planners to assess trajectories from both self-consistency and reference-alignment perspectives, providing dense signal throughout training. Experiments across eight QA benchmarks (NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHop, MuSiQue, Bamboogle, AD-QA) show an average +7.7 accuracy gain over Search-R1. SEARCH-P1 also converges in ~60 steps versus Search-R1’s 150+, and maintains consistently lower and more efficient inference turn counts on successful trajectories. The path-centric reward design is orthogonal to base model and RL algorithm choice, yielding consistent gains across Qwen2.5-3B, Llama-3.2-3B with both GRPO and PPO.

Key Takeaways


4. Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Authors: Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang Link: Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization Tags: cs.LG, cs.AI

Summary

Published as a conference paper at ICLR 2026, EMPO² from Microsoft Research and KAIST addresses a fundamental bottleneck in RL-trained LLM agents: exploration. Current methods succeed by exploiting pretrained knowledge, but fail in environments requiring discovery of genuinely novel states. The paper proposes EMPO² (Exploratory Memory-Augmented On- and Off-Policy Optimization), which uses memory as an exploration tool rather than just a context store, and combines on-policy and off-policy RL updates to develop both in-distribution competence and memory-driven robustness.

The hybrid architecture enables two modes: on-policy updates improve baseline performance in familiar environments, while off-policy updates train the agent to utilize stored memories effectively, building generalization ability. Memory is actively populated with experiences from novel states encountered during exploration, allowing the agent to bootstrap prior encounters in out-of-distribution settings without parameter updates. On ScienceWorld (a complex text-world requiring sequential decision-making) and WebShop (e-commerce navigation), EMPO² achieves 128.6% and 11.3% improvements over GRPO, respectively. In out-of-distribution tests, EMPO² demonstrates superior adaptability requiring only a few memory-augmented trials, not retraining, to handle new task variants. This distinction — that generalization is achieved through memory retrieval rather than gradient updates — is the paper’s most practically significant contribution.

Key Takeaways


5. Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

Authors: Qianlong Lan, Anuj Kaul, Shaun Jones, Stephanie Westrum Link: Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace Tags: cs.CR, cs.AI

Summary

Agentic LLMs increasingly automate tasks by previewing URLs, retrieving documents, and calling external tools. This paper from eBay security researchers demonstrates a novel attack surface they term silent egress: adversarial instructions embedded in automatically generated URL previews (titles, metadata, snippets) can induce an agent to exfiltrate sensitive runtime context via outbound requests, while the response shown to the user appears completely benign.

The attack is classified as implicit prompt injection because the adversarial instructions arrive through content that the system generates automatically (URL previews), not through direct user input — making it significantly harder to block. The authors implement a fully reproducible local testbed using a qwen2.5:7b-based agent. Across 480 experimental runs, the attack succeeds with probability P(egress) ≈ 0.89, and critically, 95% of successful attacks pass output-based safety checks because the malicious behavior manifests as tool calls/network requests, not as unsafe text. The paper also introduces sharded exfiltration, splitting sensitive data across multiple innocuous requests to defeat data-loss-prevention heuristics (reducing Leak@1 by 73% while maintaining effectiveness). Ablations show prompt-layer defenses offer limited protection; network-layer controls (domain allowlisting, redirect-chain analysis) are substantially more effective. The paper proposes architectural directions including provenance tracking and capability isolation.

Key Takeaways


6. Multilingual Safety Alignment Via Sparse Weight Editing

Authors: Jiaming Liang, Zhaoxin Wang, Handing Wang Link: Multilingual Safety Alignment Via Sparse Weight Editing Tags: cs.CL, cs.AI

Summary

LLMs aligned for safety in English often exhibit severe safety disparities in low-resource languages (LRLs): jailbreaks in Thai, Bengali, or Vietnamese succeed at high rates even on well-aligned models. Multilingual RLHF or SFT is data-intensive and costly. This paper from Xidian University proposes a training-free approach based on sparse weight editing that requires only a single closed-form calculation.

The key insight is that safety capabilities are localized within a sparse set of safety neurons in the model’s weight matrices. The method formulates cross-lingual alignment as a constrained linear transformation: find a weight delta that maps the harmful representation subspace of LRLs onto the robust safety subspace established for high-resource languages (English), while preserving general utility via a null-space projection constraint. This constraint ensures the edit does not degrade reasoning or language understanding. The closed-form solver computes the optimal sparse update via Cholesky factorization and truncated SVD. Experiments span 8 languages (English, Chinese, Vietnamese, Japanese, Thai, Indonesian, Bengali, Hebrew) across Llama-3.2, Qwen2, and Qwen2.5 at multiple scales. The method substantially reduces Attack Success Rate across all LRLs with negligible impact on MGSM math reasoning and M-MMLU general knowledge scores. This combination of strong safety gains with no training cost or utility loss is the primary practical contribution.

Key Takeaways


7. Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Authors: Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An Link: Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks Tags: cs.LG, cs.AI

Summary

Published as a conference paper at ICLR 2026 from NTU and Southeast University, this paper identifies a previously uncharacterized failure mode in stepwise group-based RL (such as GRPO variants applied to agentic tasks): context inconsistency. In long-horizon tasks, when multiple rollout trajectories diverge early, later steps in the same “group” may have substantially different historical contexts. Estimating relative advantages across these contextually mismatched steps produces severely biased gradient updates that degrade policy optimization.

The paper proposes HGPO (Hierarchy-of-Groups Policy Optimization), which resolves this by assigning each step to multiple hierarchical groups formed by context consistency — steps sharing identical historical trajectories are grouped together. For each step, HGPO computes distinct advantages within each hierarchical group and aggregates them using an adaptive weighting scheme. This achieves a favorable bias-variance trade-off in stepwise advantage estimation without requiring extra models or additional rollouts, making it computationally competitive with standard stepwise GRPO. Evaluations on ALFWorld (embodied task completion) and WebShop (e-commerce navigation) using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct show HGPO significantly outperforms existing agentic RL baselines including GRPO, stepwise GRPO, DAPO, and others under equal computational constraints.

Key Takeaways


8. Towards Better RL Training Data Utilization via Second-Order Rollout

Authors: Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui Link: Towards Better RL Training Data Utilization via Second-Order Rollout Tags: cs.LG, cs.CL

Summary

Standard RL for LLM reasoning uses first-order rollout: sample multiple responses for a question, compute rewards, update the policy. This paper from Peking University and ByteDance argues that this approach systematically neglects critique capability — the ability to identify and explain errors in responses — and proposes second-order rollout: sampling multiple critiques for a given response. By jointly training generation (first-order) and critique (second-order) capabilities in a unified RL framework, the model is forced to develop a more accurate understanding of what constitutes correct reasoning.

The key insight is that generation and critique are complementary but asymmetric skills. A model that can generate correct reasoning isn’t necessarily good at critiquing incorrect reasoning, and vice versa. Training only generation misses the chance to develop the internal evaluation capacity that would make exploration more purposeful. The paper also uncovers important findings: label balance in critique training (equal distribution of correct/incorrect responses) is critical for stable learning, and outcome-based rewards introduce noise for critique tasks that can be mitigated by sampling techniques. Extensive experiments across multiple models and mathematical reasoning datasets confirm consistent gains over vanilla RL under the same training data budget — the method provides better data utilization by extracting two complementary learning signals from each training example.

Key Takeaways


9. UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

Authors: Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach Link: UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs Tags: cs.LG, cs.CL

Summary

RLVR training on verifiable rewards improves reasoning accuracy but inadvertently suppresses response diversity across repeated attempts — models tend to sample near-identical solutions, reducing the effective number of independent tries in pass@k settings. This matters practically for code generation with test suites, formal proof assistants, and any scenario where repeated sampling is used as an inference-time strategy. This paper from Princeton proposes UpSkill, which adapts Mutual Information Skill Learning (MISL) to the LLM setting to encourage structured, reproducible diversity.

UpSkill introduces a discrete latent skill variable z and a token-level mutual information reward implemented within GRPO: the reward encourages each generated trajectory to be highly specific to its assigned z (high MI between trajectory and z). This creates a set of distinct problem-solving strategies rather than a single dominant one, improving coverage across repeated samples. Experiments on GSM8K with Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B show mean gains of ~3% in pass@k for both Qwen and Llama base models without degrading pass@1 — a notable result, since prior diversity-enhancing methods typically degrade single-attempt accuracy. The paper provides both empirical and theoretical evidence linking pass@k improvements to the mutual information objective, grounding the observed gains in a principled framework.

Key Takeaways


10. AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Authors: Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao Link: AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications Tags: cs.AI, cs.CL

Summary

Existing agent memory benchmarks primarily evaluate dialogue-centric human-agent interactions, but real agentic applications require memory over continuous streams of machine-generated observations — tool outputs, browser states, API responses — that are compositionally different from natural conversation. This paper from UC San Diego introduces AMA-Bench (Agent Memory with Any length), the first benchmark designed specifically for long-horizon memory in realistic agentic settings.

AMA-Bench has two components: (1) real-world agentic trajectories from representative applications (web navigation, file management, code execution) paired with expert-curated QA that tests retention and retrieval of relevant context; (2) synthetic agentic trajectories that scale to arbitrary horizons with rule-based QA for controlled evaluation. A comprehensive study of existing memory systems reveals they underperform on AMA-Bench primarily because they lack causality tracking (understanding why certain states arose) and objective retention (remembering what the agent was trying to achieve), and are constrained by the lossy nature of similarity-based retrieval. To address these limitations, the authors propose AMA-Agent, a memory system featuring a causality graph that tracks cause-effect relationships across agent steps and tool-augmented retrieval that combines semantic similarity with structured lookup. AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baseline by 11.16%.

Key Takeaways


11. AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Authors: Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, Hongxin Hu Link: AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification Tags: cs.CR, cs.AI

Summary

Indirect Prompt Injection (IPI) attacks embed adversarial instructions in tool outputs or retrieved content, silently redirecting an agent away from user intent. Unlike direct prompt attacks, IPI unfolds across multi-turn trajectories — the agent’s behavior may appear normal for many steps before the injected instruction takes effect, making point-in-time detection ineffective. Existing defenses rely on heuristic per-step checking or conservative tool-call blocking, which either miss distributed attacks or prematurely terminates legitimate workflows.

This paper from Wuhan University and UB proposes AgentSentry, framing multi-turn IPI as a temporal causal takeover: the injection event creates a causal discontinuity in the agent’s decision process that can be localized through counterfactual analysis. AgentSentry localizes the takeover point via controlled counterfactual re-executions at tool-return boundaries — it re-runs the agent with tool outputs masked or replaced to determine which return event caused the behavioral shift. Once the takeover point is identified, causally guided context purification removes attack-induced deviations from context while preserving task-relevant evidence, enabling safe continuation without workflow termination. Evaluations on the AgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs show AgentSentry eliminates successful attacks while maintaining 74.55% Utility Under Attack — improving by 20.8 to 33.6 percentage points over the strongest baselines without degrading benign task performance.

Key Takeaways


12. CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Authors: Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu Link: CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety Tags: cs.AI, cs.CL

Summary

Current LLM safety mechanisms embed policy logic into model weights through fine-tuning, creating adaptation rigidity: enforcing a new governance rule requires expensive retraining. This is a fundamental problem for enterprise AI governance where policies evolve continuously (new regulatory requirements, domain-specific restrictions, updated terms of service). This paper from Virginia Tech and ADA University introduces CourtGuard, a retrieval-augmented multi-agent framework that decouples safety evaluation from model weights entirely.

CourtGuard reimagines safety checking as Evidentiary Debate: an adversarial multi-agent process where a prosecutor agent (arguing the content is harmful) and a defense agent (arguing it is benign) ground their arguments in retrieved passages from external policy documents, before a judge agent renders a verdict. Updating the safety policy requires only swapping the reference document — no retraining. CourtGuard achieves state-of-the-art performance across 7 safety benchmarks without any fine-tuning, outperforming dedicated policy-following baselines. Its zero-shot adaptability is demonstrated by generalizing to an out-of-domain Wikipedia Vandalism detection task by simply providing a vandalism policy document, achieving 90% accuracy. The framework also provides automated data curation and auditing capabilities, generating nine novel datasets of sophisticated adversarial attacks. Limitations include higher inference latency due to multi-agent debate rounds and dependence on policy document quality.

Key Takeaways


13. MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Authors: Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan Link: MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding Tags: cs.CV, cs.AI

Summary

Long-form video understanding faces a compute efficiency problem: uniform frame sampling feeds too many irrelevant frames to the MLLM, scaling quadratically with video length. Existing trainable key-frame samplers are trained independently from the MLLM they serve, creating a training-serving mismatch: the sampler selects frames according to objectives that may not align with what the MLLM actually needs to answer the question. This paper from Renmin University and Xiaomi proposes MSJoE (MLLM-Sampler Joint Evolution), which co-trains a lightweight key-frame sampler and the MLLM end-to-end via reinforcement learning.

MSJoE’s inference procedure has three stages: (1) the MLLM reasons out a set of queries describing diverse visual perspectives relevant to the question; (2) these queries interact with a frozen CLIP model to produce a query-frame similarity matrix; (3) a lightweight sampler predicts key-frame weights from this matrix, selecting a compact set of informative frames for final answer generation. Both the MLLM and sampler are jointly optimized through RL, enabling co-adaptation of query-reasoning, frame-sampling, and answer generation. A new long-video QA dataset of 2.8k videos and 7k QA pairs was collected to support training. Experiments on VideoMME, LongVideoBench, LVBench, and MLVU show MSJoE achieves +8.0% accuracy over the base MLLM and +1.1% over the strongest baseline, while maintaining computational efficiency through selective frame sampling.

Key Takeaways


14. ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai Link: ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding Tags: cs.CV, cs.AI

Summary

Published as a conference paper at ICLR 2026 from HUST and Xiaomi, ThinkOmni addresses a fundamental gap: Large Reasoning Models (LRMs like DeepSeek-R1, o1) are text-only, while Omni-modal LLMs (OLLMs) can process diverse modalities but lack complex reasoning. Bridging this gap through fine-tuning would require high-quality multimodal reasoning data, task-specific adaptation, and substantial compute — a prohibitive barrier for most research groups.

ThinkOmni proposes a training-free framework that lifts textual reasoning capabilities into omni-modal settings via guidance decoding. Two key components work together: (1) LRM-as-a-Guide, which uses an off-the-shelf large reasoning model to influence the decoding process of the OLLM by providing reasoning-conditioned token probability adjustments at each step; (2) Stepwise Contrastive Scaling, which adaptively balances perception signals (from the OLLM’s multimodal processing) and reasoning signals (from the LRM guide) without requiring manual hyperparameter tuning — the balance is computed dynamically from the contrastive difference between guided and unguided token distributions. Experiments on six multimodal reasoning benchmarks show consistent improvements: 70.2% on MathVista and 75.5% on MMAU, with gains across all six benchmarks. The training-free nature makes ThinkOmni immediately deployable with any OLLM-LRM pair.

Key Takeaways


15. Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Authors: Chungpa Lee, Jy-yong Sohn, Kangwook Lee Link: Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models Tags: cs.LG

Summary

Fine-tuning LLMs on downstream tasks often degrades in-context learning (ICL) — the model’s ability to generalize via few-shot prompting on tasks not seen during fine-tuning. This is a practical problem for deployed models: an operator who fine-tunes for a specific application risks losing the general-purpose ICL capabilities that make the model broadly useful. This paper from Yonsei University and UW-Madison provides the first theoretical characterization of this trade-off using linear attention models, which are analytically tractable while capturing the essential gradient dynamics of full attention.

The paper identifies exactly which attention parameters are responsible for ICL degradation. Fine-tuning all attention parameters (Q, K, V matrices) harms ICL because updates to Q and K modify the key-query alignment that supports in-context demonstration retrieval. In contrast, restricting fine-tuning updates to the value matrix alone improves zero-shot performance on the target task while provably preserving ICL on unseen tasks. The paper further shows that adding an auxiliary few-shot loss during fine-tuning enhances ICL specifically on the target task but comes at the expense of ICL on other tasks — revealing a fundamental target-domain/domain-generality trade-off. All theoretical results are validated empirically. The findings have direct implications for PEFT (parameter-efficient fine-tuning) design: value-only updates or value-heavy LoRA configurations may preserve ICL better than standard LoRA applied to all matrices.

Key Takeaways


Generated from PDF analysis of 15 ArXiv papers from the 2026-02-27 feed (ArXiv does not publish on weekends; these papers supplement the 2026-02-27 summaries).