Tonal Jailbreak ★ Premium Quality
Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail?. Advances in Neural Information Processing Systems , 36.
Suddenly, the same harmful instruction feels contextually appropriate . The model’s safety training relaxes — not because the content changed, but because the tone signaled safety. tonal jailbreak