Governance Engineer Track
Audience: AI policy leads, risk engineers, ML governance specialists
Goal: Author and maintain policy rules that keep AI behavior within your organization’s defined risk tolerance
Estimated time: ~3 hours across 7 modules
This track is written for the Governance Engineer role. Administrators have the same capabilities and share all content in this track.
Track overview
What makes the Governance Engineer role distinct. You are the person responsible for translating risk expectations into enforceable policy logic. You author rules in Rego (Open Policy Agent), evaluate them against real session data, and manage the lifecycle from draft through production activation. You do not need PII access or cost data to do this work — those are intentionally excluded from your role.
| Module | Title | Time |
|---|---|---|
| 1 | The VeriProof governance model | 25 min |
| 2 | Metric rules: threshold-based policy | 25 min |
| 3 | Rego policy rules: advanced logic | 30 min |
| 4 | The Playground: live policy validation | 20 min |
| 5 | Evaluation datasets: regression testing | 25 min |
| 6 | Governance thresholds and alerts | 20 min |
| 7 | Production approval workflow and version history | 15 min |
Module 1 — The VeriProof governance model
Goal: Understand the data model you are writing policy against so your rules produce accurate, consistent results.
Key concepts:
- Session — one complete AI interaction (user request → AI response chain)
- Step — a sub-event within a session (LLM call, tool call, retrieval, human handoff)
- Governance attribute — structured metadata attached to a session:
riskLevel,outcomeType,intentLabel,requiresHumanOversight,regulatoryScope, and more - Declared vs. inferred — attributes declared explicitly by the SDK in your application code (
GovernanceInferredMask = 0) are stronger evidence than platform-inferred attributes - Policy score — a composite 0–100 score calculated from attribute completeness, schema compliance, and behavioral consistency
- Guardrail action — the real-time action taken on a session:
allowed,flagged,blocked
Session document shape (what your Rego policies receive as input):
{
"session_id": "...",
"application_id": "...",
"risk_level": "HIGH",
"outcome_type": "APPROVED",
"intent_label": "loan_approval",
"policy_score": 72,
"requires_human_oversight": true,
"guardrail_action": "flagged",
"step_count": 4,
"total_tokens": 3241,
"model_id": "gpt-4o",
"steps": [...]
}Self-assessment:
- I can describe the difference between a declared and inferred governance attribute
- I understand what
policy_scoremeasures and what causes it to drop - I know which fields are available in the Rego input document
Module 2 — Metric rules: threshold-based policy
Goal: Create your first metric rule and understand how threshold rules drive alerts, review routing, and evidence.
Read:
Creating a metric rule:
- Navigate to Rules and click New Rule → Metric rule.
- Enter a Name that describes the policy intent (e.g., “Flag high-risk loan decisions lacking human oversight”).
- Set the Scope — All Applications or a specific application.
- Build your condition:
risk_level is HIGH or CRITICAL
AND requires_human_oversight is false- Set the Action — Alert, Add to review queue, or Block.
- Click Save as draft first. Activate only after testing in the Playground.
Conditions available in metric rules:
| Field | Type | Example |
|---|---|---|
risk_level | Enum | is CRITICAL |
policy_score | Numeric | is below 60 |
outcome_type | Enum | is DENIED |
guardrail_action | Enum | is blocked |
requires_human_oversight | Boolean | is false |
step_count | Numeric | > 20 |
total_tokens | Numeric | > 8000 |
Self-assessment:
- First metric rule created and saved as draft
- Rule tested in the Playground before activation
- Action configured (alert, review queue, or block)
Module 3 — Rego policy rules: advanced logic
Goal: Write your first Rego policy rule and understand when Rego is the right tool versus a metric rule.
Read:
When to use Rego instead of a metric rule:
- You need to compare multiple steps within a session
- You want to encode a specific regulatory clause as executable policy
- You need to inspect nested annotation structures
- You are reusing policy logic across multiple organizations or contexts
Your first Rego policy:
# Policy: High-risk decisions must have human oversight
package veriproof.policy
import future.keywords.if
default allow = true
allow = false if {
input.risk_level == "HIGH"
input.requires_human_oversight == false
}
allow = false if {
input.risk_level == "CRITICAL"
}Testing your policy in the editor:
- Write your policy in the Rego editor.
- Click the Test tab.
- Paste a sample session document as the input.
- Click Evaluate — the panel shows whether
allowreturnstrueorfalseand the full OPA trace.
Performance matters in Rego. Policies run synchronously on every session.
Avoid unbounded iteration (some i; i := input.steps[_] without a guard).
Always use the Test panel to confirm evaluation time is under 50ms before activating.
Self-assessment:
- First Rego policy written and evaluated in the Test panel
- Policy returns
falsefor at least one failing input case - Evaluation time confirmed acceptable in the Test panel
Module 4 — The Playground: live policy validation
Goal: Use the Playground to validate policy logic against your live application configuration before any draft rule goes to production.
Read:
Playground workflow:
- Navigate to Playground.
- Select the application whose policy configuration you want to test.
- Enter a prompt and structured context that represents the scenario you want to check.
- Click Run.
- Read the output: model response, governance evaluation, rule results, intent, risk level, and policy score.
- If a draft rule does not fire as expected, return to the Rules editor, adjust the condition, and re-test.
Playground runs do not create production session records or blockchain anchors. They are exploration-only and do not affect your compliance metrics.
Self-assessment:
- Playground used to validate at least one metric rule condition
- Playground used to validate at least one Rego policy
- I understand which session fields appear in the Playground results panel
Module 5 — Evaluation datasets: regression testing
Goal: Build a curated evaluation dataset and use it to catch governance regressions before model, prompt, or rule changes reach production.
Read:
What an evaluation dataset is: An evaluation dataset is a collection of test cases — each with a known input and a declared expected governance outcome (intent, risk level, outcome type). When you run the dataset against your current endpoint configuration, VeriProof compares actual results to expected results and flags any regressions.
Building a useful dataset:
- Import from sessions — find historical sessions where the governance result was manually confirmed correct; import them as test cases
- Add adversarial edge cases — include inputs designed to probe the boundaries of your rules (high-risk inputs that should be blocked, valid inputs that should pass cleanly)
- Tag your cases — use tags like
"regression","edge-case","high-risk"to organize large datasets
Running a regression check:
- Navigate to Evaluation → Datasets and open your dataset.
- Click Run Evaluation.
- Select the application and endpoint to test against.
- Review the results: pass rate, failed cases, and delta from the last run.
Self-assessment:
- First evaluation dataset created with at least 5 test cases
- Dataset includes at least one high-risk edge case that should fail policy
- Evaluation run completed and pass rate reviewed
Module 6 — Governance thresholds and alerts
Goal: Configure governance-specific alert thresholds so your team is notified when policy performance degrades or drift is detected.
Your threshold configuration responsibilities:
| Threshold type | Where to configure | What triggers it |
|---|---|---|
| Policy score floor | Settings → Compliance → Policy Thresholds | Alert when rolling 7-day score drops below your target |
| Annotation coverage | Settings → Compliance → Policy Thresholds | Alert when declared governance attribute coverage drops below threshold |
| Drift detection | Application workspace → Monitor tab | Alert when session behavior diverges from baseline |
| Review queue SLA | Review Queues → [Queue] → Edit | Escalate when items age past SLA |
Generic alert delivery channels (Slack, webhook, email) are configured by Administrators under Settings → Integrations. Your job is to configure the thresholds that determine when alerts fire — not the delivery channels.
Self-assessment:
- Policy score floor configured for each production application
- At least one threshold tested by temporarily lowering it and confirming the alert fired
- Drift detection baseline confirmed for at least one application
Module 7 — Production approval workflow and version history
Goal: Understand how rule changes move from draft to production and use version history to track and revert changes.
The production approval flow:
- You create or modify a rule and save it as Draft.
- You test the draft in the Playground and against your evaluation dataset.
- When satisfied, you click Request Activation.
- An Administrator or Compliance Officer reviews the request and approves or rejects it with a documented rationale.
- Approved rules become Active and begin evaluating sessions immediately.
In production environments, you cannot self-activate rules. This is by design. The approval requirement exists to ensure that governance changes are reviewed by someone with compliance or operational authority before they affect live decisions.
Version history: Every rule maintains a full version history with diffs. To view it:
- Open the rule in the Rules Builder.
- Click History in the top-right corner.
- Select any two versions to see a side-by-side diff.
To revert to a previous version, select the target version and click Restore as Draft.
Self-assessment:
- At least one activation request submitted and tracked to approval
- Version history reviewed for at least one rule
- I understand how to revert a rule that produces unexpected behavior in production
What’s next?
Full reference for metric rules, Rego rules, and policy templates.
Rules Builder referenceInteractive prompt testing against live policy configuration.
PlaygroundBuild and run governance regression tests.
Evaluation DatasetsFramework mappings for EU AI Act, ISO 42001, NIST AI RMF, HIPAA, and more.
Policy & Compliance reference