Tonal Jailbreak ★ Premium Quality

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail?. Advances in Neural Information Processing Systems , 36.

Suddenly, the same harmful instruction feels contextually appropriate . The model’s safety training relaxes — not because the content changed, but because the tone signaled safety. tonal jailbreak