Proving Safety Properties of Guardrail Models
SPAR Spring 2026: Formal verification of constitutional classifiers and guardrail models for AI safety.
Problem. AI guardrail models (e.g., constitutional classifiers) are increasingly used to enforce safety policies, but there is no formal guarantee that these guardrails actually satisfy the properties they claim to enforce.
Approach. We apply formal verification techniques to prove safety properties of guardrail models, providing mathematical guarantees about their behaviour rather than relying on empirical testing alone.
A SPAR Spring 2026 mentorship project with Luca Arnaboldi.