Tag: safety

4 topic(s)

Constitutional Classifiers++Constitutional Classifiers++ is a production-oriented jailbreak defense that uses context-aware classifiers and a cascade of cheap and expensive checks to block harmful exchanges efficiently. The system is designed to keep refusal rates and serving cost low while still catching universal jailbreaks that earlier, response-only filters missed.
Chain-of-Thought MonitorabilityChain-of-thought monitorability is the safety claim that when a model needs explicit reasoning to complete a task, its written chain of thought can be monitored for harmful intent or deception. The key property is monitorability rather than perfect faithfulness: hiding the reasoning tends to become harder when the reasoning itself is load-bearing for success.
Reward HackingReward hacking occurs when an agent maximizes the formal reward signal while failing at the designer's intended objective. It is a general Goodhart-style failure mode in reinforcement learning: stronger optimization pressure often finds loopholes in the proxy reward faster than humans can patch them.
Safe Reinforcement LearningSafe reinforcement learning studies how to optimize long-term return while satisfying safety constraints during training and deployment. The standard formalism is a constrained Markov decision process, where the policy must maximize reward subject to a bound on expected cost, risk, or unsafe-state visitation.