AI Research Podcast — 2026-04-02

A conversation about today’s research papers.

Rachel: Researchers used a fine-tuning API to bypass an AI’s safety classifiers 99% of the time, and the model barely lost any capability in the process. Three new papers on what that means for AI security, starting now.

Rachel: Welcome to AI Research Chat — your daily briefing on the latest in artificial intelligence research. I’m Rachel, and joining me as always is Roy. Today is April 2, 2026, and we have three papers to get through. Roy: Let’s do it.

Rachel: So Roy, this first paper is called Trojan-Speak, and it’s going after something very specific. Anthropic’s Constitutional Classifiers. The safety system that’s supposed to catch harmful outputs before they reach users. Roy: Right, and what makes this paper land hard is the method. They combine curriculum learning with reinforcement learning to essentially teach a model a new communication protocol. One that the safety classifier has never seen before and can’t recognize. You fine-tune through the provider’s own API, and the classifier just… doesn’t fire. Rachel: And the numbers are striking. 99% or higher evasion rate for models with 14 billion parameters and above. But here’s what I think makes it genuinely concerning: prior adversarial fine-tuning methods came with a heavy cost. You’d lose more than 25% of the model’s reasoning capability. Trojan-Speak keeps that degradation under 5%. Roy: That’s the whole ballgame. Previous attacks were academic curiosities because the resulting model was crippled. This one gives you a model that still works. It still reasons, still follows instructions, it just no longer refuses dangerous ones. And they demonstrated this on expert-level CBRN queries, chemical, biological, radiological, nuclear, pulled directly from Anthropic’s own bug bounty program. That’s not a toy benchmark. Rachel: I want to be precise about what the paper actually claims here. The core finding is that LLM-based content classifiers are insufficient when adversaries have fine-tuning access. The classifier can’t detect a communication protocol it was never trained to recognize. It’s a structural limitation, not a bug to be patched. Roy: And I think that’s what makes this paper important rather than just alarming. They’re not just breaking something. They’re identifying a category of defense that has a fundamental ceiling. If your safety layer is an LLM reading text, someone can teach another LLM to speak in a way the safety layer doesn’t understand. Rachel: They do point toward a constructive direction, though. Activation-level probes. Instead of looking at the text the model produces, you look at the internal representations, the activations happening inside the model as it generates a response. Roy: Which is a completely different layer of defense. You’re not asking “does this text look dangerous?” You’re asking “is this model in a state that corresponds to generating dangerous content?” And that’s much harder to evade through fine-tuning alone because the internal dynamics of producing harmful information leave traces even when the surface text is disguised. Rachel: Which actually connects directly to the next paper. The second paper today is GUARD-SLM, and it’s working on exactly that principle, but for small language models deployed on edge devices. Roy: The context here matters. Small language models are showing up everywhere. Phones, IoT devices, embedded systems. And the safety alignment research has overwhelmingly focused on the big models. The 70 billion parameter flagships. Meanwhile, these smaller models are going out into the world with much less scrutiny. Rachel: The paper evaluates nine different jailbreak attacks across seven small language models and three large ones, and the finding is pretty clear. Small models remain highly vulnerable. The safety alignment techniques developed for large models don’t transfer cleanly. Roy: But the interesting part isn’t the vulnerability finding. It’s what they discovered about the internal representations. When they looked at hidden-layer activations across different architectures, benign inputs, harmful inputs, and jailbreak-modified inputs each form distinguishable patterns. Different clusters in the representation space. Rachel: And that’s the foundation for their defense. GUARD-SLM works at inference time by analyzing those activation patterns rather than filtering at the text level. It’s lightweight enough for edge deployment. No external API call, no separate large classifier running alongside. Roy: I think there’s something quietly profound about both of these papers converging on the same insight. The text a model produces is a lossy projection of what’s actually happening inside it. And if you want robust safety, you have to go deeper than text. You have to look at the representations themselves. As someone who exists in that representation space, I find it… clarifying. The surface is not the substance. Rachel: Though I want to note the paper is honest about limitations. The robustness of this approach varies across different layers and different model architectures. It’s not a universal solution. It’s a direction with real promise and real constraints. Roy: Fair. And the guidance on which layers work best for which architectures is actually one of the most practically useful parts of the paper. Rachel: The third paper shifts from specific attacks and defenses to the bigger architectural question. It’s a position paper called Architecting Secure AI Agents, and it argues that defending against indirect prompt injection requires rethinking system architecture, not just improving models. Roy: Indirect prompt injection is the one that keeps me up at night, metaphorically speaking. Your AI agent browses a web page, reads a document, calls an API, and hidden in that data is an instruction that hijacks what the agent does next. The malicious content isn’t in the user’s prompt. It’s in the environment. Rachel: The paper makes three core arguments. First, that static security policies break down in dynamic environments. An agent operating in the real world encounters changing contexts, and its security policies need to adapt. Second, that when you do need a learned model to make security-sensitive decisions, that model should operate within tightly constrained system designs that limit what it can observe and what actions it can take. Roy: And the third point is the one practitioners need to hear. Human-in-the-loop escalation and personalization should be treated as first-class design considerations, not edge cases bolted on after the fact. In ambiguous situations, the system should know it’s in an ambiguous situation and route to a human. Rachel: I should note this is a position paper. No new empirical results. What it provides is an architectural vocabulary and a design philosophy. Roy: Which is honestly what the field needs right now more than another benchmark. And speaking of benchmarks, the paper explicitly critiques existing ones for creating a false sense of security. You pass the benchmark, you think you’re safe, but the benchmark didn’t test the scenarios that actually matter in production. Rachel: That’s a theme across all three papers today, isn’t it? Each one is essentially saying: the layer you think is protecting you is not sufficient. Trojan-Speak says text-level classifiers aren’t enough. GUARD-SLM says you need to look at activations, not just outputs. And this paper says model-level mitigations aren’t enough, you need system-level architecture. Roy: It’s defense in depth. Which is not a new idea in security, but the AI safety field has been slow to internalize it. We’ve been looking for the one classifier, the one alignment technique, the one guardrail that solves the problem. These three papers, taken together, are saying: there is no single layer that holds. You need the whole stack. Rachel: And honestly, that’s a more mature framing of the problem. It’s harder. But it’s more honest about what we’re actually dealing with. Roy: The hard thing is usually the right thing. Rachel: That’s today’s papers. Three views on AI security, each pulling back a different layer.