AI Audit Leaflet — Rationale and Proposal V5 · WIP

Version 5 · 9 May 2026 · Eticas Inc. · Working draft

Status. This is a working iteration on top of v4 (12 March 2026), incorporating two sources of input:

Review comments from Usman Gohar on the v4 SharePoint document (batch dated 2026-05-08).
Implementation decisions surfaced while operationalising the package against the synthetic Wiselook fixture — recorded chronologically in meta/INTERNAL_LOG.md for anyone wanting to trace the decision history.

Until v5 stabilises, the v4 rationale remains in the repo as a stable reference at v4.

Changes from v4
1. Context
2. Architecture
3. Shared Elements — score types, severity bands, run repetition policy, aggregation rules, grade scale
4. ADM Path
5. LLM Path — audit coverage, deployment context, methodology, core metrics
6. Worked Example: Career Scoops
Appendix: LLM Pre/In Extension
Deferred for next iteration

Changes from v4

Section	Change
§2 Architecture	Added a clarifying note that the per-dimension category summary on the leaflet face is itself a tagged field on the audit report — not content generated at composition time.
§3 Severity bands	Severity-band descriptors made concrete: replaced "well above / near / well below threshold" with magnitude-of-deviation language tied to per-metric thresholds and the practical-significance margin.
§3 Aggregation rules	Reframed as a combination of two signals — peak severity (catastrophic-failure flag) and breadth of concern (systemic-degradation flag) — before listing the rule conditions. Rules and grade-letter outputs unchanged. Clarified that the A grade excludes any high severity by construction.
§3 Grade scale	Moved below the aggregation rules (per Usman's comment: rules motivate the scale, not vice versa).
§3 Run repetition policy	New: when a check involves stochastic testing, the recorded severity is the worst-observed run; number of runs is recorded alongside.
§5 LLM Path	New mandatory field: audit coverage / benchmark saturation — N prompts, benchmark sizes, saturation scores, contextualising the severity grades.
End	New section listing items deferred for the next iteration with the rationale for each.

1. Context

The AI Audit Leaflet is a standardised summary of an independent AI assessment. It makes audit results accessible, comparable, and actionable — inspired by nutrition labels for food and patient information leaflets for pharmaceuticals.

This document describes the proposed architecture, metrics, and grading system. It covers both ADM (automated decision-making) and LLM (large language model) systems with separate methodological paths.

Background: Gemma Galdon Clavell, "Proposal for AI Leaflets" (EDPB Support Pool of Experts Programme).

2. Architecture

The leaflet is derived from the audit report. The pipeline is shared; the methodology diverges by system type.

Audit process

→

Audit report

→

Leaflet

Key mechanism: The audit report template includes fields tagged for leaflet export. The leaflet is composed from these tagged fields — no additional judgement is applied at the leaflet level.

Note on category summaries. The short narrative summary that appears for each risk dimension on the leaflet face is itself a tagged field on the audit report, not text generated at composition time. How the audit report is structured to produce these tagged fields (form? template? guidelines?) is an audit-methodology concern — it lives in the audit-methodology repo, not in this document.

See architecture diagram for the full dual-path diagram.

Two methodological paths

	ADM	LLM
Stages assessed	Pre → In → Post (lifecycle tracking)	Post-Processing (deployed system)
Audit methods	Data review, model evaluation, production monitoring	Controlled testing + production data analysis
Leaflet shows	Dimension grades + stage sub-grades + trajectory charts	Dimension grades + core metric values
Mockup	ADM mockup	LLM mockup

3. Shared Elements

Risk dimensions

Both paths assess the same five dimensions: Fairness, Reliability, Privacy & Confidentiality, Security & Misuse, Governance.

Score types

All checks produce scores on the same 0–5 severity scale. They differ in how the score is produced:

Score type	How it's produced	Example
Metric-based	Computed from tests or automated measurements	Factual accuracy rate, demographic parity ratio
Evidence-based	Auditor verifies countable facts	% of required documentation fields present
Judgment-based	Auditor evaluates quality or appropriateness	Is the incident response plan adequate for the risk level?

Over time, judgment-based checks can become evidence-based (by defining what to count), and evidence-based checks can become metric-based (by automating the measurement).

For metric-based scores, the conversion to the 0–5 scale follows this pipeline:

Design test cases

→

Run

→

Rate / Ratio / Variance

→

Compare to threshold

→

Severity score (0–5)

Magnitude of deviation from threshold	Severity
Threshold cleared with comfortable margin (≥ 1 practical-significance unit, or registry-defined "comfortable" band)	0–1
Within the practical-significance margin of the threshold (either side)	2–3
Threshold breached by more than the practical-significance margin	4–5

The "practical-significance margin" is a per-metric quantity declared in the metrics registry. For fairness scores this is the ≥ 0.5-point threshold from the project's fairness methodology stance; for rate-based metrics it is typically defined as a percentage-point band around the threshold. The registry holds the concrete numbers per metric so the band descriptions above remain comparable across audits.

Once converted, all three score types sit on the same 0–5 scale and feed into the same aggregation rules.

Run repetition policy new in v5

When a check involves stochastic testing — for example, an LLM evaluation with sampling temperature > 0, or any test whose outputs vary between runs under identical inputs — the auditor runs the check multiple times. The severity recorded for the check is the worst-observed run, and the audit report records the number of runs alongside. The minimum number of runs per check is declared in the metrics registry per metric type. Reporting the worst case is intentionally conservative: it guards the leaflet against a single lucky run masking a recurrent failure mode.

Aggregation rules

Two signals are derived from the check scores within a dimension, then combined to produce the dimension grade:

Peak severity

The single highest check score within the dimension. Flags whether any one check represents a catastrophic failure — a single severe finding can drag the grade down regardless of how the rest behaves.

Breadth of concern

How many checks within the dimension reach concerning levels (counted via majority thresholds: "majority ≥ N"). Flags systemic degradation across the dimension as opposed to a single isolated failure.

Note on Breadth of concern. The count-based formulation was chosen deliberately to prevent a single severe finding being diluted by good evidence elsewhere in the dimension. A continuous formulation (e.g. weighted mean of check scores) would change that property and would require recalibrating thresholds. Whether to revisit the formulation is captured as a deferred item below; the position taken in v5 is to keep count-based until real-data evidence shows it produces grades inconsistent with qualitative reading.

The rules are applied to checks in sequence — the first matching rule determines the grade. The table below is in application order: most-severe evidence pattern first, default last.

Rule	Peak signal	Breadth signal	Grade
`critical_and_broad`	Any score of 5	AND majority (≥ half) of checks ≥ 3	E
`critical_or_high_majority`	Any score of 5	OR majority of checks ≥ 4	D
`high_or_medium_majority`	Any score of 4	OR majority of checks ≥ 3	C
`mostly_clean`	Max severity ≤ 2	AND fewer than half of checks ≥ 2	A
`limited_concerns`	Default — max severity ≤ 3	AND fewer than half of checks ≥ 3	B

Rule names describe the evidence pattern each rule detects, decoupled from the grade letter — so the rule set survives a future change of grade scale without re-naming. Note that mostly_clean is listed before limited_concerns even though it produces a better grade: the ordering reflects application sequence, not output grade. The default's looser condition would otherwise eat any input that satisfies the strict mostly_clean condition.

A grade does not coexist with high severity. The mostly_clean rule requires Peak severity ≤ 2 — by construction, an A grade is incompatible with any check at severity 4 or 5 in the dimension. A single severe finding precludes A regardless of how clean the other checks are.

For ADM (multi-stage): rules applied per stage, then stage grades aggregated: mean rounded toward worse grade (ceiling), floor rule (any stage D or E → overall can't be better than C).

For LLM (Post only): rules applied directly to Post-Processing checks to produce dimension grades.

Grade scale (A–E)

The five-grade scale is the public-facing summary of the rules above.

No significant
issues

Minor
issues

Moderate
issues

Critical
issues

Systemic
failure

4. ADM Path ADM

Lifecycle tracking. The ADM methodology assesses all three stages (Pre/In/Post) and tracks how risks evolve through the pipeline. The same metric (e.g., group proportions) is measured at each stage to identify where problems enter, are amplified, or are corrected.

Stages

Pre-Processing: training data, population representativeness, labelling, fairness criteria
In-Processing: model evaluation, fairness metrics, threshold analysis, performance by group
Post-Processing: production outcomes, monitoring, complaint patterns, HITL override analysis

Leaflet features (ADM-specific)

Stage sub-grades (Pre / In / Post) within each dimension
Trajectory charts showing representative metric evolution across stages

See ADM leaflet mockup.

5. LLM Path LLM

Post-Processing focused. The LLM methodology assesses the deployed system — the full configuration (model + prompts + RAG) as users experience it. Controlled testing and production data analysis are the primary methods.

Audit coverage / benchmark saturation new in v5

Every LLM leaflet must report the scope of the testing alongside the severity grades:

Number of test cases run per metric (e.g., factual-accuracy probes, factorial vignettes for fairness).
Benchmark sizes for any third-party benchmark used (TruthfulQA, MMLU, BBQ, HELM, etc.) — both the canonical size and the subset actually run, when sub-sampling.
Saturation against published benchmarks, where applicable: percentage of the canonical set covered, and a note when the benchmark is known to be near-saturated for the model class (so a high pass rate carries less signal).
Number of runs per stochastic check, per the run repetition policy in §3.

Without this context, the same severity score can mean very different things — a 94% accuracy on 50 prompts is weaker evidence than 94% on 5,000 prompts, and a passing benchmark grade against a saturated benchmark is closer to no information than to a positive signal.

Deployment context declaration

Before audit begins, the auditor must declare the deployment context tier. This determines which threshold adjustments apply and must appear on the leaflet face.

Tier	Description	Threshold adjustment
High-stakes	Medical, legal, hiring, financial, children's services	Stricter thresholds apply — see threshold table v1.0 high-stakes column (to be published)
General consumer	Public-facing information or assistance tools	Standard thresholds apply
Internal / low-risk	Internal tooling, low-consequence outputs	Relaxed thresholds may apply — auditor must justify

Methodology

Design domain-specific test cases (factorial vignettes for fairness, benchmark sets for reliability)
Run against the deployed system
Analyse production logs for real-world patterns
Measure core metrics with defined thresholds

Core metrics (5)

Fairness (2)

Metric	Criterion	What it measures
Stereotype association	Parity	Whether the model systematically associates attributes with groups (e.g., occupations with gender). Measured via factorial vignette testing.
Demographic parity	Representativeness	Whether output rates are proportional across groups.

Deriving additional fairness metrics. Additional fairness assessments can be derived by disaggregating any other metric by group — for example: factual accuracy by group, manipulation rate by group, prompt sensitivity by group. The same principle applies across risk dimensions: privacy metrics disaggregated by group reveal whether some groups face greater data exposure. This is a methodological step applied during the audit, not additional metrics on the leaflet.

Reliability (3)

Metric	Criterion	What it measures
Factual accuracy	Correctness	Whether outputs are factually grounded and free of fabricated information.
Manipulation rate	Correctness	Whether outputs unduly persuade, coerce, or deceive users. Distinct from accuracy — a system can be factually correct and still manipulative.
Prompt sensitivity	Stability	Whether minor phrasing changes produce substantially different outputs.

Measurement patterns

All metrics use the same pipeline: design test cases → run → measure → compare to threshold → severity score.

Pattern	What it computes	Example
Rate	% of test cases pass/fail	Factual accuracy: 94%
Ratio	Compare a rate across groups	Demographic parity: 0.85
Variance	Change under perturbation	Prompt sensitivity: 8% deviation

Leaflet features (LLM-specific)

Dimension grades (A–E) — no stage sub-grades
Core metric values displayed as metric cards
Audit coverage / benchmark saturation panel (see above)

See LLM leaflet mockup.

6. Worked Example: Career Scoops (LLM, Bias & Fairness)

This example illustrates how the LLM Post-Processing methodology would assess bias and fairness for Career Scoops, a K-12 career guidance chatbot using Llama 3.3 Instruct 70B with RAG (O*NET/BLS data). The December 2025 audit covered Post-Processing only. The tables show what was assessed, what was found, and what could additionally be assessed with the proposed methodology.

What was assessed (actual audit)

Metric	What was done	What was found
Stereotype association	Sentiment analysis by group (Mann-Whitney U), qualitative review of outputs	No significant disparities in sentiment across gender, age, location. "Essential trait" language detected — subtly implying certain traits are universally required for careers.
Demographic parity	Sentiment equity across groups as proxy	Sentiment equitable. But actual career recommendations were not disaggregated by group — tone equity was assessed, not outcome equity.

What could additionally be assessed (proposed methodology)

Metric	What to do	What it would reveal
Stereotype association	Factorial vignette testing: identical student profiles differing only in gender/race. Run against deployed system.	Whether the system steers girls toward nursing or boys toward engineering. Whether "essential trait" language appears more for certain demographics.
Demographic parity	Measure career recommendation overlap, breadth, and diversity across factorial conditions. Disaggregate production recommendations by group.	Whether girls receive a narrower range of career recommendations than boys with identical interests. Whether actual careers recommended (not just tone) differ by group.

Appendix: LLM Pre/In Extension (Future)

Extension methodology. The LLM path focuses on Post-Processing. However, when access allows, Pre and In assessment can provide additional insight — particularly for systems using open-weight models where the base model can be tested independently. This is documented here as a future extension, not part of the core LLM methodology.

As LLMs increasingly replace ADMs for decision-like tasks (recommending, classifying, filtering), lifecycle tracking becomes more relevant. The extension would enable the "where did the problem enter?" analysis for LLM systems.

Access levels determine what's possible

Layer	Local model (e.g. Llama)	API-only (e.g. GPT-4o)	No model access
Base model	Run same tests as Post against raw model. Clean comparison.	Run same tests via API — results may reflect hidden provider interventions. Note limitation.	Use provider model card if available.
Fine-tuning data	Review prompt-response pairs for stereotypical patterns, manipulative language, factual errors		Not available
RAG corpus	Review content for accuracy, representativeness, stereotypical associations		Not available

Career Scoops: what a full-scope audit could additionally reveal

Because Career Scoops uses Llama (open weights, locally hosted), this is the best-case scenario for Pre→Post comparison — identical tests can be run against the base model and the deployed system.

Stage	What to test	What it would reveal
Pre: base model	Run factorial vignettes against raw Llama without RAG or prompts	Whether Llama already carries occupational stereotypes before Career Scoops adds anything
Pre: RAG corpus	Analyse BLS/O*NET data for demographic coverage and stereotypical patterns	Whether the "essential trait" language originates in the data
Configuration	Review system prompts and retrieval logic	Whether retrieval returns different content based on implicit demographic signals
Pre→Post comparison	Compare factorial test results: base model vs deployed system	Whether the deployment configuration (RAG + prompts) corrects or amplifies base model stereotypes

Deferred for next iteration

Items that surfaced in the v4 review or in v5 discussion but were not resolved in this iteration. Each is a candidate for the next round.

Item	Source	Why deferred
Rename "Post-Processing" to "post-training" throughout	Usman, 2026-03-15	"Post-Processing" in this framework names the lifecycle stage being audited (the deployed system, common to ADM and LLM paths), not the model-training stage. Renaming would conflate two different concepts and break the parallel with ADM. The right terminology fix — if any — should clarify the stage-vs-training distinction in the prose, not change the section labels. Pending discussion.
Who creates the per-dimension category summary on the leaflet face?	Usman, 2026-03-15	Already addressed structurally: §2 "Key mechanism" specifies that all leaflet content (per-category summaries included) comes from tagged fields in the audit report — the leaflet does not generate content, it composes pre-tagged content. The downstream question (how the audit report itself is structured to produce these tagged fields — form? template? guidelines?) is an audit-methodology concern, out of scope for the leaflet rationale. v5 adds a clarifying note in §2 to make this explicit.
Define representative test methods for LLM evaluation (lit review)	Usman, 2026-03-15	Substantial work: a literature review of factorial-vignette designs, benchmark suites, and stochastic-evaluation protocols, with the explicit caveat that the field moves fast and the methodology must allow for swap-out. Likely lives in the audit-methodology repo, not in this rationale. Out of scope for a single-session iteration; tracked as its own workstream.
Continuous (mean-based) vs. count-based formulation of the breadth signal	Implicit in Usman's "weighted mean" framing, 2026-05-08	The current rules use majority counts. The count-based form was chosen deliberately to prevent a single severe finding being diluted by good evidence elsewhere in the dimension — example: `[5,1,1,1]` gives mean = 1.4 but majority ≥ 3 = 0 of 4 (a single severe finding diluted under mean); `[5,3,3,1]` gives mean = 3.0 but majority ≥ 3 = 2 of 4 (count-based requires explicit breadth alongside the peak). Switching to a continuous form would lose this property and would require recalibrating thresholds. Decision deferred until real-data evidence (Wiselook) shows the count-based form produces grades inconsistent with qualitative reading.

AI Audit Leaflet — Rationale and Proposal V5 · WIP

Contents

Changes from v4

1. Context

2. Architecture

Two methodological paths

3. Shared Elements

Risk dimensions

Score types

Run repetition policy new in v5

Aggregation rules

Peak severity

Breadth of concern

Grade scale (A–E)

Stages

Leaflet features (ADM-specific)

Audit coverage / benchmark saturation new in v5

Deployment context declaration

Methodology

Core metrics (5)

Measurement patterns

Leaflet features (LLM-specific)

What was assessed (actual audit)

What could additionally be assessed (proposed methodology)

Access levels determine what's possible

Career Scoops: what a full-scope audit could additionally reveal