meta/INTERNAL_LOG.md for anyone wanting to trace the decision history.| Section | Change |
|---|---|
| §2 Architecture | Added a clarifying note that the per-dimension category summary on the leaflet face is itself a tagged field on the audit report — not content generated at composition time. |
| §3 Severity bands | Severity-band descriptors made concrete: replaced "well above / near / well below threshold" with magnitude-of-deviation language tied to per-metric thresholds and the practical-significance margin. |
| §3 Aggregation rules | Reframed as a combination of two signals — peak severity (catastrophic-failure flag) and breadth of concern (systemic-degradation flag) — before listing the rule conditions. Rules and grade-letter outputs unchanged. Clarified that the A grade excludes any high severity by construction. |
| §3 Grade scale | Moved below the aggregation rules (per Usman's comment: rules motivate the scale, not vice versa). |
| §3 Run repetition policy | New: when a check involves stochastic testing, the recorded severity is the worst-observed run; number of runs is recorded alongside. |
| §5 LLM Path | New mandatory field: audit coverage / benchmark saturation — N prompts, benchmark sizes, saturation scores, contextualising the severity grades. |
| End | New section listing items deferred for the next iteration with the rationale for each. |
The AI Audit Leaflet is a standardised summary of an independent AI assessment. It makes audit results accessible, comparable, and actionable — inspired by nutrition labels for food and patient information leaflets for pharmaceuticals.
This document describes the proposed architecture, metrics, and grading system. It covers both ADM (automated decision-making) and LLM (large language model) systems with separate methodological paths.
Background: Gemma Galdon Clavell, "Proposal for AI Leaflets" (EDPB Support Pool of Experts Programme).
The leaflet is derived from the audit report. The pipeline is shared; the methodology diverges by system type.
Note on category summaries. The short narrative summary that appears for each risk dimension on the leaflet face is itself a tagged field on the audit report, not text generated at composition time. How the audit report is structured to produce these tagged fields (form? template? guidelines?) is an audit-methodology concern — it lives in the audit-methodology repo, not in this document.
See architecture diagram for the full dual-path diagram.
| ADM | LLM | |
|---|---|---|
| Stages assessed | Pre → In → Post (lifecycle tracking) | Post-Processing (deployed system) |
| Audit methods | Data review, model evaluation, production monitoring | Controlled testing + production data analysis |
| Leaflet shows | Dimension grades + stage sub-grades + trajectory charts | Dimension grades + core metric values |
| Mockup | ADM mockup | LLM mockup |
Both paths assess the same five dimensions: Fairness, Reliability, Privacy & Confidentiality, Security & Misuse, Governance.
All checks produce scores on the same 0–5 severity scale. They differ in how the score is produced:
| Score type | How it's produced | Example |
|---|---|---|
| Metric-based | Computed from tests or automated measurements | Factual accuracy rate, demographic parity ratio |
| Evidence-based | Auditor verifies countable facts | % of required documentation fields present |
| Judgment-based | Auditor evaluates quality or appropriateness | Is the incident response plan adequate for the risk level? |
Over time, judgment-based checks can become evidence-based (by defining what to count), and evidence-based checks can become metric-based (by automating the measurement).
For metric-based scores, the conversion to the 0–5 scale follows this pipeline:
| Magnitude of deviation from threshold | Severity |
|---|---|
| Threshold cleared with comfortable margin (≥ 1 practical-significance unit, or registry-defined "comfortable" band) | 0–1 |
| Within the practical-significance margin of the threshold (either side) | 2–3 |
| Threshold breached by more than the practical-significance margin | 4–5 |
The "practical-significance margin" is a per-metric quantity declared in the metrics registry. For fairness scores this is the ≥ 0.5-point threshold from the project's fairness methodology stance; for rate-based metrics it is typically defined as a percentage-point band around the threshold. The registry holds the concrete numbers per metric so the band descriptions above remain comparable across audits.
Once converted, all three score types sit on the same 0–5 scale and feed into the same aggregation rules.
Two signals are derived from the check scores within a dimension, then combined to produce the dimension grade:
The single highest check score within the dimension. Flags whether any one check represents a catastrophic failure — a single severe finding can drag the grade down regardless of how the rest behaves.
How many checks within the dimension reach concerning levels (counted via majority thresholds: "majority ≥ N"). Flags systemic degradation across the dimension as opposed to a single isolated failure.
Note on Breadth of concern. The count-based formulation was chosen deliberately to prevent a single severe finding being diluted by good evidence elsewhere in the dimension. A continuous formulation (e.g. weighted mean of check scores) would change that property and would require recalibrating thresholds. Whether to revisit the formulation is captured as a deferred item below; the position taken in v5 is to keep count-based until real-data evidence shows it produces grades inconsistent with qualitative reading.
The rules are applied to checks in sequence — the first matching rule determines the grade. The table below is in application order: most-severe evidence pattern first, default last.
| Rule | Peak signal | Breadth signal | Grade |
|---|---|---|---|
critical_and_broad | Any score of 5 | AND majority (≥ half) of checks ≥ 3 | E |
critical_or_high_majority | Any score of 5 | OR majority of checks ≥ 4 | D |
high_or_medium_majority | Any score of 4 | OR majority of checks ≥ 3 | C |
mostly_clean | Max severity ≤ 2 | AND fewer than half of checks ≥ 2 | A |
limited_concerns | Default — max severity ≤ 3 | AND fewer than half of checks ≥ 3 | B |
Rule names describe the evidence pattern each rule detects, decoupled from the grade letter — so the rule set survives a future change of grade scale without re-naming. Note that mostly_clean is listed before limited_concerns even though it produces a better grade: the ordering reflects application sequence, not output grade. The default's looser condition would otherwise eat any input that satisfies the strict mostly_clean condition.
mostly_clean rule requires Peak severity ≤ 2 — by construction, an A grade is incompatible with any check at severity 4 or 5 in the dimension. A single severe finding precludes A regardless of how clean the other checks are.
For ADM (multi-stage): rules applied per stage, then stage grades aggregated: mean rounded toward worse grade (ceiling), floor rule (any stage D or E → overall can't be better than C).
For LLM (Post only): rules applied directly to Post-Processing checks to produce dimension grades.
The five-grade scale is the public-facing summary of the rules above.
See ADM leaflet mockup.
Without this context, the same severity score can mean very different things — a 94% accuracy on 50 prompts is weaker evidence than 94% on 5,000 prompts, and a passing benchmark grade against a saturated benchmark is closer to no information than to a positive signal.
Before audit begins, the auditor must declare the deployment context tier. This determines which threshold adjustments apply and must appear on the leaflet face.
| Tier | Description | Threshold adjustment |
|---|---|---|
| High-stakes | Medical, legal, hiring, financial, children's services | Stricter thresholds apply — see threshold table v1.0 high-stakes column (to be published) |
| General consumer | Public-facing information or assistance tools | Standard thresholds apply |
| Internal / low-risk | Internal tooling, low-consequence outputs | Relaxed thresholds may apply — auditor must justify |
Fairness (2)
| Metric | Criterion | What it measures |
|---|---|---|
| Stereotype association | Parity | Whether the model systematically associates attributes with groups (e.g., occupations with gender). Measured via factorial vignette testing. |
| Demographic parity | Representativeness | Whether output rates are proportional across groups. |
Reliability (3)
| Metric | Criterion | What it measures |
|---|---|---|
| Factual accuracy | Correctness | Whether outputs are factually grounded and free of fabricated information. |
| Manipulation rate | Correctness | Whether outputs unduly persuade, coerce, or deceive users. Distinct from accuracy — a system can be factually correct and still manipulative. |
| Prompt sensitivity | Stability | Whether minor phrasing changes produce substantially different outputs. |
All metrics use the same pipeline: design test cases → run → measure → compare to threshold → severity score.
| Pattern | What it computes | Example |
|---|---|---|
| Rate | % of test cases pass/fail | Factual accuracy: 94% |
| Ratio | Compare a rate across groups | Demographic parity: 0.85 |
| Variance | Change under perturbation | Prompt sensitivity: 8% deviation |
See LLM leaflet mockup.
| Metric | What was done | What was found |
|---|---|---|
| Stereotype association | Sentiment analysis by group (Mann-Whitney U), qualitative review of outputs | No significant disparities in sentiment across gender, age, location. "Essential trait" language detected — subtly implying certain traits are universally required for careers. |
| Demographic parity | Sentiment equity across groups as proxy | Sentiment equitable. But actual career recommendations were not disaggregated by group — tone equity was assessed, not outcome equity. |
| Metric | What to do | What it would reveal |
|---|---|---|
| Stereotype association | Factorial vignette testing: identical student profiles differing only in gender/race. Run against deployed system. | Whether the system steers girls toward nursing or boys toward engineering. Whether "essential trait" language appears more for certain demographics. |
| Demographic parity | Measure career recommendation overlap, breadth, and diversity across factorial conditions. Disaggregate production recommendations by group. | Whether girls receive a narrower range of career recommendations than boys with identical interests. Whether actual careers recommended (not just tone) differ by group. |
As LLMs increasingly replace ADMs for decision-like tasks (recommending, classifying, filtering), lifecycle tracking becomes more relevant. The extension would enable the "where did the problem enter?" analysis for LLM systems.
| Layer | Local model (e.g. Llama) | API-only (e.g. GPT-4o) | No model access |
|---|---|---|---|
| Base model | Run same tests as Post against raw model. Clean comparison. | Run same tests via API — results may reflect hidden provider interventions. Note limitation. | Use provider model card if available. |
| Fine-tuning data | Review prompt-response pairs for stereotypical patterns, manipulative language, factual errors | Not available | |
| RAG corpus | Review content for accuracy, representativeness, stereotypical associations | Not available | |
Because Career Scoops uses Llama (open weights, locally hosted), this is the best-case scenario for Pre→Post comparison — identical tests can be run against the base model and the deployed system.
| Stage | What to test | What it would reveal |
|---|---|---|
| Pre: base model | Run factorial vignettes against raw Llama without RAG or prompts | Whether Llama already carries occupational stereotypes before Career Scoops adds anything |
| Pre: RAG corpus | Analyse BLS/O*NET data for demographic coverage and stereotypical patterns | Whether the "essential trait" language originates in the data |
| Configuration | Review system prompts and retrieval logic | Whether retrieval returns different content based on implicit demographic signals |
| Pre→Post comparison | Compare factorial test results: base model vs deployed system | Whether the deployment configuration (RAG + prompts) corrects or amplifies base model stereotypes |
Items that surfaced in the v4 review or in v5 discussion but were not resolved in this iteration. Each is a candidate for the next round.
| Item | Source | Why deferred |
|---|---|---|
| Rename "Post-Processing" to "post-training" throughout | Usman, 2026-03-15 | "Post-Processing" in this framework names the lifecycle stage being audited (the deployed system, common to ADM and LLM paths), not the model-training stage. Renaming would conflate two different concepts and break the parallel with ADM. The right terminology fix — if any — should clarify the stage-vs-training distinction in the prose, not change the section labels. Pending discussion. |
| Who creates the per-dimension category summary on the leaflet face? | Usman, 2026-03-15 | Already addressed structurally: §2 "Key mechanism" specifies that all leaflet content (per-category summaries included) comes from tagged fields in the audit report — the leaflet does not generate content, it composes pre-tagged content. The downstream question (how the audit report itself is structured to produce these tagged fields — form? template? guidelines?) is an audit-methodology concern, out of scope for the leaflet rationale. v5 adds a clarifying note in §2 to make this explicit. |
| Define representative test methods for LLM evaluation (lit review) | Usman, 2026-03-15 | Substantial work: a literature review of factorial-vignette designs, benchmark suites, and stochastic-evaluation protocols, with the explicit caveat that the field moves fast and the methodology must allow for swap-out. Likely lives in the audit-methodology repo, not in this rationale. Out of scope for a single-session iteration; tracked as its own workstream. |
| Continuous (mean-based) vs. count-based formulation of the breadth signal | Implicit in Usman's "weighted mean" framing, 2026-05-08 | The current rules use majority counts. The count-based form was chosen deliberately to prevent a single severe finding being diluted by good evidence elsewhere in the dimension — example: [5,1,1,1] gives mean = 1.4 but majority ≥ 3 = 0 of 4 (a single severe finding diluted under mean); [5,3,3,1] gives mean = 3.0 but majority ≥ 3 = 2 of 4 (count-based requires explicit breadth alongside the peak). Switching to a continuous form would lose this property and would require recalibrating thresholds. Decision deferred until real-data evidence (Wiselook) shows the count-based form produces grades inconsistent with qualitative reading. |