💡

Quick overview

Evaluation Datasets

Evaluation Datasets let teams turn historical sessions or curated test cases into repeatable governance benchmarks. Run the same dataset against candidate models, prompts, or endpoints to compare behavior before shipping changes to production.

Use this surface to catch regressions early and make deployment decisions with evidence instead of intuition.

Evaluation Datasets

Evaluation Datasets let you build curated sets of test inputs and run them against your instrumented AI endpoints to measure governance consistency, behavioral equivalence between model versions, and the impact of prompt changes — before those changes reach production.

Who can use this

Available to

AdministratorGovernance EngineerCompliance OfficerAuditor

Not available to

DeveloperBusiness Owner

Creating and editing datasets requires Administrator or Governance Engineer access. Compliance Officers and Auditors can view datasets and run evaluations but cannot create or modify test cases.

Gov. Engineer

Evaluation Datasets are your primary regression testing tool. Build datasets from historical sessions where the governance result was manually confirmed, then re-run them whenever you change a rule, model, or prompt configuration.

Compliance

You can view existing datasets, review their test cases, and run evaluations against the current endpoint configuration. Export evaluation results as evidence to verify that live AI behavior matches the policy expectations declared in your compliance documentation.

Auditor

You can view datasets in scope for your audit engagement, review their test cases, and export evaluation results as evidence. Run an evaluation to verify that live AI behavior at the time of your review matches the declared expectations — then export the result to your evidence package.

What an Evaluation Dataset Contains

A dataset is a collection of test cases. Each test case includes:

Input — a text prompt or structured input payload
Expected intent — the intent label you expect the model to classify this input as
Expected outcome — the governance outcome you expect (Success / Failure / Escalated / Cancelled)
Expected risk level — the risk level you expect
Tags — arbitrary labels for grouping (e.g., “high-risk”, “edge-case”, “regression-2026-03”)

Test cases represent inputs that have known correct governance behavior. They are typically sourced from:

Real sessions where the governance result was manually reviewed and confirmed correct
Synthetic inputs designed to probe specific guardrail or intent-classification behavior
Regression cases from past incidents

Creating a Dataset

Navigate to Evaluation → Datasets and click New Dataset.
Enter a Name and Description.
Add test cases:
- Import from sessions — select historical sessions from the Decisions Explorer; VeriProof populates the expected fields from the session’s recorded governance result
- Add manually — write test cases by hand
- Import CSV — upload a CSV with the test case fields as columns

Running an Evaluation

To evaluate your instrumented endpoint against a dataset:

Open the dataset and click Run Evaluation.
Select the endpoint to test — this is the HTTP endpoint of your instrumented application.
Select the adapter configuration (optional, for endpoints that support configuration parameters).
Click Start Run.

VeriProof submits each test case to your endpoint via the ingest API and compares the governance result your SDK records against the expected values defined in the dataset.

Evaluation results show:

Match rate — percentage of test cases where all expected fields matched
Intent accuracy — percentage of cases where the classified intent matched expected
Outcome accuracy — percentage of cases where the decision outcome matched expected
Risk level accuracy — percentage of cases where the assigned risk level matched expected
Diff view — for each non-matching case, a side-by-side comparison of expected vs. actual governance result

A/B Endpoint Evaluation

A/B evaluation runs the same dataset against two different endpoints simultaneously and compares their governance outputs. This is the recommended workflow when evaluating a model upgrade or prompt change.

To set up an A/B evaluation:

Open a dataset and click New A/B Run.
Configure Arm A — typically your current production endpoint.
Configure Arm B — your new endpoint with the model upgrade or prompt change.
Click Start A/B Run.

The results show a side-by-side comparison of every governance metric — governance score distributions, risk level distributions, intent accuracy, and outcome accuracy — between Arm A and Arm B.

Export A/B evaluation results as the supporting evidence for your model change documentation requirement under EU AI Act Article 11 (technical documentation) and ISO 42001 Clause 8.2 (AI system impact assessment).

Managing Datasets

Datasets are versioned — each evaluation run is stored against the dataset version that was active at the time of the run. This preserves a complete record of how your governance posture evolved over successive model and prompt changes.

To update a dataset without losing historical run data, click New Version rather than editing in place.

Test individual prompts interactively against governance policies before adding them to a dataset.

Playground

Find historical sessions to import as dataset test cases.

Sessions

Ensure your endpoints are instrumented correctly before running evaluations.

SDK Instrumentation Guide