Proving Safety Properties of Guardrail Models

Problem. AI guardrail models (e.g., constitutional classifiers) are increasingly used to enforce safety policies, but there is no formal guarantee that these guardrails actually satisfy the properties they claim to enforce.

Approach. We apply formal verification techniques to prove safety properties of guardrail models, providing mathematical guarantees about their behaviour rather than relying on empirical testing alone.

A SPAR Spring 2026 mentorship project with Luca Arnaboldi.