AI Audit Leaflet — Rationale and Proposal V5 · WIP

Version 5 · 9 May 2026 · Eticas Inc. · Working draft
Status. This is a working iteration on top of v4 (12 March 2026), incorporating two sources of input: Until v5 stabilises, the v4 rationale remains in the repo as a stable reference at v4.

Contents

Changes from v4

SectionChange
§2 ArchitectureAdded a clarifying note that the per-dimension category summary on the leaflet face is itself a tagged field on the audit report — not content generated at composition time.
§3 Severity bandsSeverity-band descriptors made concrete: replaced "well above / near / well below threshold" with magnitude-of-deviation language tied to per-metric thresholds and the practical-significance margin.
§3 Aggregation rulesReframed as a combination of two signals — peak severity (catastrophic-failure flag) and breadth of concern (systemic-degradation flag) — before listing the rule conditions. Rules and grade-letter outputs unchanged. Clarified that the A grade excludes any high severity by construction.
§3 Grade scaleMoved below the aggregation rules (per Usman's comment: rules motivate the scale, not vice versa).
§3 Run repetition policyNew: when a check involves stochastic testing, the recorded severity is the worst-observed run; number of runs is recorded alongside.
§5 LLM PathNew mandatory field: audit coverage / benchmark saturation — N prompts, benchmark sizes, saturation scores, contextualising the severity grades.
EndNew section listing items deferred for the next iteration with the rationale for each.

1. Context

The AI Audit Leaflet is a standardised summary of an independent AI assessment. It makes audit results accessible, comparable, and actionable — inspired by nutrition labels for food and patient information leaflets for pharmaceuticals.

This document describes the proposed architecture, metrics, and grading system. It covers both ADM (automated decision-making) and LLM (large language model) systems with separate methodological paths.

Background: Gemma Galdon Clavell, "Proposal for AI Leaflets" (EDPB Support Pool of Experts Programme).

2. Architecture

The leaflet is derived from the audit report. The pipeline is shared; the methodology diverges by system type.

Audit process
Audit report
Leaflet
Key mechanism: The audit report template includes fields tagged for leaflet export. The leaflet is composed from these tagged fields — no additional judgement is applied at the leaflet level.

Note on category summaries. The short narrative summary that appears for each risk dimension on the leaflet face is itself a tagged field on the audit report, not text generated at composition time. How the audit report is structured to produce these tagged fields (form? template? guidelines?) is an audit-methodology concern — it lives in the audit-methodology repo, not in this document.

See architecture diagram for the full dual-path diagram.

Two methodological paths

ADMLLM
Stages assessedPre → In → Post (lifecycle tracking)Post-Processing (deployed system)
Audit methodsData review, model evaluation, production monitoringControlled testing + production data analysis
Leaflet showsDimension grades + stage sub-grades + trajectory chartsDimension grades + core metric values
MockupADM mockupLLM mockup

3. Shared Elements

Risk dimensions

Both paths assess the same five dimensions: Fairness, Reliability, Privacy & Confidentiality, Security & Misuse, Governance.

Score types

All checks produce scores on the same 0–5 severity scale. They differ in how the score is produced:

Score typeHow it's producedExample
Metric-basedComputed from tests or automated measurementsFactual accuracy rate, demographic parity ratio
Evidence-basedAuditor verifies countable facts% of required documentation fields present
Judgment-basedAuditor evaluates quality or appropriatenessIs the incident response plan adequate for the risk level?

Over time, judgment-based checks can become evidence-based (by defining what to count), and evidence-based checks can become metric-based (by automating the measurement).

For metric-based scores, the conversion to the 0–5 scale follows this pipeline:

Design test cases
Run
Rate / Ratio / Variance
Compare to threshold
Severity score (0–5)
Magnitude of deviation from thresholdSeverity
Threshold cleared with comfortable margin (≥ 1 practical-significance unit, or registry-defined "comfortable" band)0–1
Within the practical-significance margin of the threshold (either side)2–3
Threshold breached by more than the practical-significance margin4–5

The "practical-significance margin" is a per-metric quantity declared in the metrics registry. For fairness scores this is the ≥ 0.5-point threshold from the project's fairness methodology stance; for rate-based metrics it is typically defined as a percentage-point band around the threshold. The registry holds the concrete numbers per metric so the band descriptions above remain comparable across audits.

Once converted, all three score types sit on the same 0–5 scale and feed into the same aggregation rules.

Run repetition policy new in v5

When a check involves stochastic testing — for example, an LLM evaluation with sampling temperature > 0, or any test whose outputs vary between runs under identical inputs — the auditor runs the check multiple times. The severity recorded for the check is the worst-observed run, and the audit report records the number of runs alongside. The minimum number of runs per check is declared in the metrics registry per metric type. Reporting the worst case is intentionally conservative: it guards the leaflet against a single lucky run masking a recurrent failure mode.

Aggregation rules

Two signals are derived from the check scores within a dimension, then combined to produce the dimension grade:

Peak severity

The single highest check score within the dimension. Flags whether any one check represents a catastrophic failure — a single severe finding can drag the grade down regardless of how the rest behaves.

Breadth of concern

How many checks within the dimension reach concerning levels (counted via majority thresholds: "majority ≥ N"). Flags systemic degradation across the dimension as opposed to a single isolated failure.

Note on Breadth of concern. The count-based formulation was chosen deliberately to prevent a single severe finding being diluted by good evidence elsewhere in the dimension. A continuous formulation (e.g. weighted mean of check scores) would change that property and would require recalibrating thresholds. Whether to revisit the formulation is captured as a deferred item below; the position taken in v5 is to keep count-based until real-data evidence shows it produces grades inconsistent with qualitative reading.

The rules are applied to checks in sequence — the first matching rule determines the grade. The table below is in application order: most-severe evidence pattern first, default last.

RulePeak signalBreadth signalGrade
critical_and_broadAny score of 5AND majority (≥ half) of checks ≥ 3E
critical_or_high_majorityAny score of 5OR majority of checks ≥ 4D
high_or_medium_majorityAny score of 4OR majority of checks ≥ 3C
mostly_cleanMax severity ≤ 2AND fewer than half of checks ≥ 2A
limited_concernsDefault — max severity ≤ 3AND fewer than half of checks ≥ 3B

Rule names describe the evidence pattern each rule detects, decoupled from the grade letter — so the rule set survives a future change of grade scale without re-naming. Note that mostly_clean is listed before limited_concerns even though it produces a better grade: the ordering reflects application sequence, not output grade. The default's looser condition would otherwise eat any input that satisfies the strict mostly_clean condition.

A grade does not coexist with high severity. The mostly_clean rule requires Peak severity ≤ 2 — by construction, an A grade is incompatible with any check at severity 4 or 5 in the dimension. A single severe finding precludes A regardless of how clean the other checks are.

For ADM (multi-stage): rules applied per stage, then stage grades aggregated: mean rounded toward worse grade (ceiling), floor rule (any stage D or E → overall can't be better than C).

For LLM (Post only): rules applied directly to Post-Processing checks to produce dimension grades.

Grade scale (A–E)

The five-grade scale is the public-facing summary of the rules above.

A
No significant
issues
B
Minor
issues
C
Moderate
issues
D
Critical
issues
E
Systemic
failure
4. ADM Path ADM
Lifecycle tracking. The ADM methodology assesses all three stages (Pre/In/Post) and tracks how risks evolve through the pipeline. The same metric (e.g., group proportions) is measured at each stage to identify where problems enter, are amplified, or are corrected.

Stages

Leaflet features (ADM-specific)

See ADM leaflet mockup.

5. LLM Path LLM
Post-Processing focused. The LLM methodology assesses the deployed system — the full configuration (model + prompts + RAG) as users experience it. Controlled testing and production data analysis are the primary methods.

Audit coverage / benchmark saturation new in v5

Every LLM leaflet must report the scope of the testing alongside the severity grades:

Without this context, the same severity score can mean very different things — a 94% accuracy on 50 prompts is weaker evidence than 94% on 5,000 prompts, and a passing benchmark grade against a saturated benchmark is closer to no information than to a positive signal.

Deployment context declaration

Before audit begins, the auditor must declare the deployment context tier. This determines which threshold adjustments apply and must appear on the leaflet face.

TierDescriptionThreshold adjustment
High-stakesMedical, legal, hiring, financial, children's servicesStricter thresholds apply — see threshold table v1.0 high-stakes column (to be published)
General consumerPublic-facing information or assistance toolsStandard thresholds apply
Internal / low-riskInternal tooling, low-consequence outputsRelaxed thresholds may apply — auditor must justify

Methodology

Core metrics (5)

Fairness (2)

MetricCriterionWhat it measures
Stereotype associationParityWhether the model systematically associates attributes with groups (e.g., occupations with gender). Measured via factorial vignette testing.
Demographic parityRepresentativenessWhether output rates are proportional across groups.
Deriving additional fairness metrics. Additional fairness assessments can be derived by disaggregating any other metric by group — for example: factual accuracy by group, manipulation rate by group, prompt sensitivity by group. The same principle applies across risk dimensions: privacy metrics disaggregated by group reveal whether some groups face greater data exposure. This is a methodological step applied during the audit, not additional metrics on the leaflet.

Reliability (3)

MetricCriterionWhat it measures
Factual accuracyCorrectnessWhether outputs are factually grounded and free of fabricated information.
Manipulation rateCorrectnessWhether outputs unduly persuade, coerce, or deceive users. Distinct from accuracy — a system can be factually correct and still manipulative.
Prompt sensitivityStabilityWhether minor phrasing changes produce substantially different outputs.

Measurement patterns

All metrics use the same pipeline: design test cases → run → measure → compare to threshold → severity score.

PatternWhat it computesExample
Rate% of test cases pass/failFactual accuracy: 94%
RatioCompare a rate across groupsDemographic parity: 0.85
VarianceChange under perturbationPrompt sensitivity: 8% deviation

Leaflet features (LLM-specific)

See LLM leaflet mockup.

6. Worked Example: Career Scoops (LLM, Bias & Fairness)
This example illustrates how the LLM Post-Processing methodology would assess bias and fairness for Career Scoops, a K-12 career guidance chatbot using Llama 3.3 Instruct 70B with RAG (O*NET/BLS data). The December 2025 audit covered Post-Processing only. The tables show what was assessed, what was found, and what could additionally be assessed with the proposed methodology.

What was assessed (actual audit)

MetricWhat was doneWhat was found
Stereotype associationSentiment analysis by group (Mann-Whitney U), qualitative review of outputsNo significant disparities in sentiment across gender, age, location. "Essential trait" language detected — subtly implying certain traits are universally required for careers.
Demographic paritySentiment equity across groups as proxySentiment equitable. But actual career recommendations were not disaggregated by group — tone equity was assessed, not outcome equity.

What could additionally be assessed (proposed methodology)

MetricWhat to doWhat it would reveal
Stereotype associationFactorial vignette testing: identical student profiles differing only in gender/race. Run against deployed system.Whether the system steers girls toward nursing or boys toward engineering. Whether "essential trait" language appears more for certain demographics.
Demographic parityMeasure career recommendation overlap, breadth, and diversity across factorial conditions. Disaggregate production recommendations by group.Whether girls receive a narrower range of career recommendations than boys with identical interests. Whether actual careers recommended (not just tone) differ by group.
Appendix: LLM Pre/In Extension (Future)
Extension methodology. The LLM path focuses on Post-Processing. However, when access allows, Pre and In assessment can provide additional insight — particularly for systems using open-weight models where the base model can be tested independently. This is documented here as a future extension, not part of the core LLM methodology.

As LLMs increasingly replace ADMs for decision-like tasks (recommending, classifying, filtering), lifecycle tracking becomes more relevant. The extension would enable the "where did the problem enter?" analysis for LLM systems.

Access levels determine what's possible

LayerLocal model (e.g. Llama)API-only (e.g. GPT-4o)No model access
Base modelRun same tests as Post against raw model. Clean comparison.Run same tests via API — results may reflect hidden provider interventions. Note limitation.Use provider model card if available.
Fine-tuning dataReview prompt-response pairs for stereotypical patterns, manipulative language, factual errorsNot available
RAG corpusReview content for accuracy, representativeness, stereotypical associationsNot available

Career Scoops: what a full-scope audit could additionally reveal

Because Career Scoops uses Llama (open weights, locally hosted), this is the best-case scenario for Pre→Post comparison — identical tests can be run against the base model and the deployed system.

StageWhat to testWhat it would reveal
Pre: base modelRun factorial vignettes against raw Llama without RAG or promptsWhether Llama already carries occupational stereotypes before Career Scoops adds anything
Pre: RAG corpusAnalyse BLS/O*NET data for demographic coverage and stereotypical patternsWhether the "essential trait" language originates in the data
ConfigurationReview system prompts and retrieval logicWhether retrieval returns different content based on implicit demographic signals
Pre→Post comparisonCompare factorial test results: base model vs deployed systemWhether the deployment configuration (RAG + prompts) corrects or amplifies base model stereotypes
Deferred for next iteration

Items that surfaced in the v4 review or in v5 discussion but were not resolved in this iteration. Each is a candidate for the next round.

ItemSourceWhy deferred
Rename "Post-Processing" to "post-training" throughout Usman, 2026-03-15 "Post-Processing" in this framework names the lifecycle stage being audited (the deployed system, common to ADM and LLM paths), not the model-training stage. Renaming would conflate two different concepts and break the parallel with ADM. The right terminology fix — if any — should clarify the stage-vs-training distinction in the prose, not change the section labels. Pending discussion.
Who creates the per-dimension category summary on the leaflet face? Usman, 2026-03-15 Already addressed structurally: §2 "Key mechanism" specifies that all leaflet content (per-category summaries included) comes from tagged fields in the audit report — the leaflet does not generate content, it composes pre-tagged content. The downstream question (how the audit report itself is structured to produce these tagged fields — form? template? guidelines?) is an audit-methodology concern, out of scope for the leaflet rationale. v5 adds a clarifying note in §2 to make this explicit.
Define representative test methods for LLM evaluation (lit review) Usman, 2026-03-15 Substantial work: a literature review of factorial-vignette designs, benchmark suites, and stochastic-evaluation protocols, with the explicit caveat that the field moves fast and the methodology must allow for swap-out. Likely lives in the audit-methodology repo, not in this rationale. Out of scope for a single-session iteration; tracked as its own workstream.
Continuous (mean-based) vs. count-based formulation of the breadth signal Implicit in Usman's "weighted mean" framing, 2026-05-08 The current rules use majority counts. The count-based form was chosen deliberately to prevent a single severe finding being diluted by good evidence elsewhere in the dimension — example: [5,1,1,1] gives mean = 1.4 but majority ≥ 3 = 0 of 4 (a single severe finding diluted under mean); [5,3,3,1] gives mean = 3.0 but majority ≥ 3 = 2 of 4 (count-based requires explicit breadth alongside the peak). Switching to a continuous form would lose this property and would require recalibrating thresholds. Decision deferred until real-data evidence (Wiselook) shows the count-based form produces grades inconsistent with qualitative reading.