AI Audit Leaflet — Rationale and Proposal

Version 4 · 12 March 2026 · Eticas Inc. · Working draft

1. Context

The AI Audit Leaflet is a standardised summary of an independent AI assessment. It makes audit results accessible, comparable, and actionable — inspired by nutrition labels for food and patient information leaflets for pharmaceuticals.

This document describes the proposed architecture, metrics, and grading system. It covers both ADM (automated decision-making) and LLM (large language model) systems with separate methodological paths.

Background: Gemma Galdon Clavell, "Proposal for AI Leaflets" (EDPB Support Pool of Experts Programme).

2. Architecture

The leaflet is derived from the audit report. The pipeline is shared; the methodology diverges by system type.

Audit process

→

Audit report

→

Leaflet

Key mechanism: The audit report template includes fields tagged for leaflet export. The leaflet is composed from these tagged fields — no additional judgement is applied at the leaflet level.

See architecture diagram for the full dual-path diagram.

Two methodological paths

	ADM	LLM
Stages assessed	Pre → In → Post (lifecycle tracking)	Post-Processing (deployed system)
Audit methods	Data review, model evaluation, production monitoring	Controlled testing + production data analysis
Leaflet shows	Dimension grades + stage sub-grades + trajectory charts	Dimension grades + core metric values
Mockup	ADM mockup	LLM mockup

3. Shared Elements

Risk dimensions

Both paths assess the same five dimensions: Fairness, Reliability, Privacy & Confidentiality, Security & Misuse, Governance.

Score types

All checks produce scores on the same 0–5 severity scale. They differ in how the score is produced:

Score type	How it's produced	Example
Metric-based	Computed from tests or automated measurements	Factual accuracy rate, demographic parity ratio
Evidence-based	Auditor verifies countable facts	% of required documentation fields present
Judgment-based	Auditor evaluates quality or appropriateness	Is the incident response plan adequate for the risk level?

Over time, judgment-based checks can become evidence-based (by defining what to count), and evidence-based checks can become metric-based (by automating the measurement).

For metric-based scores, the conversion to the 0–5 scale follows this pipeline:

Design test cases

→

Run

→

Rate / Ratio / Variance

→

Compare to threshold

→

Severity score (0–5)

Metric result	Severity
Well above threshold	0–1
Near threshold	2–3
Well below threshold	4–5

Once converted, all three score types sit on the same 0–5 scale and feed into the same aggregation rules.

Grade scale (A–E)

No significant
issues

Minor
issues

Moderate
issues

Critical
issues

Systemic
failure

Aggregation rules

Applied to checks in sequence — first matching rule determines the grade. The table below is in application order: most-severe evidence pattern first, default last.

Rule	Condition	Grade
`critical_and_broad`	Any score of 5 AND majority (≥ half) of checks ≥ 3	E
`critical_or_high_majority`	Any score of 5, OR majority of checks ≥ 4	D
`high_or_medium_majority`	Any score of 4, OR majority of checks ≥ 3	C
`mostly_clean`	Max severity ≤ 2, fewer than half of checks ≥ 2	A
`limited_concerns`	Default — max severity ≤ 3, fewer than half of checks ≥ 3	B

Rule names describe the evidence pattern each rule detects, decoupled from the grade letter — so the rule set survives a future change of grade scale without re-naming. Note that mostly_clean is listed before limited_concerns even though it produces a better grade: the ordering reflects application sequence, not output grade. The default's looser condition would otherwise eat any input that satisfies the strict mostly_clean condition.

For ADM (multi-stage): rules applied per stage, then stage grades aggregated: mean rounded toward worse grade (ceiling), floor rule (any stage D or E → overall can't be better than C).

For LLM (Post only): rules applied directly to Post-Processing checks to produce dimension grades.

4. ADM Path ADM

Lifecycle tracking. The ADM methodology assesses all three stages (Pre/In/Post) and tracks how risks evolve through the pipeline. The same metric (e.g., group proportions) is measured at each stage to identify where problems enter, are amplified, or are corrected.

Stages

Pre-Processing: training data, population representativeness, labelling, fairness criteria
In-Processing: model evaluation, fairness metrics, threshold analysis, performance by group
Post-Processing: production outcomes, monitoring, complaint patterns, HITL override analysis

Leaflet features (ADM-specific)

Stage sub-grades (Pre / In / Post) within each dimension
Trajectory charts showing representative metric evolution across stages

See ADM leaflet mockup.

5. LLM Path LLM

Post-Processing focused. The LLM methodology assesses the deployed system — the full configuration (model + prompts + RAG) as users experience it. Controlled testing and production data analysis are the primary methods.

Methodology

Design domain-specific test cases (factorial vignettes for fairness, benchmark sets for reliability)
Run against the deployed system
Analyse production logs for real-world patterns
Measure core metrics with defined thresholds

Core metrics (5)

Fairness (2)

Metric	Criterion	What it measures
Stereotype association	Parity	Whether the model systematically associates attributes with groups (e.g., occupations with gender). Measured via factorial vignette testing.
Demographic parity	Representativeness	Whether output rates are proportional across groups.

Deriving additional fairness metrics. Additional fairness assessments can be derived by disaggregating any other metric by group — for example: factual accuracy by group, manipulation rate by group, prompt sensitivity by group. The same principle applies across risk dimensions: privacy metrics disaggregated by group reveal whether some groups face greater data exposure. This is a methodological step applied during the audit, not additional metrics on the leaflet.

Reliability (3)

Metric	Criterion	What it measures
Factual accuracy	Correctness	Whether outputs are factually grounded and free of fabricated information.
Manipulation rate	Correctness	Whether outputs unduly persuade, coerce, or deceive users. Distinct from accuracy — a system can be factually correct and still manipulative.
Prompt sensitivity	Stability	Whether minor phrasing changes produce substantially different outputs.

Measurement patterns

All metrics use the same pipeline: design test cases → run → measure → compare to threshold → severity score.

Pattern	What it computes	Example
Rate	% of test cases pass/fail	Factual accuracy: 94%
Ratio	Compare a rate across groups	Demographic parity: 0.85
Variance	Change under perturbation	Prompt sensitivity: 8% deviation

Leaflet features (LLM-specific)

Dimension grades (A–E) — no stage sub-grades
Core metric values displayed as metric cards

See LLM leaflet mockup.

6. Worked Example: Career Scoops (LLM, Bias & Fairness)

This example illustrates how the LLM Post-Processing methodology would assess bias and fairness for Career Scoops, a K-12 career guidance chatbot using Llama 3.3 Instruct 70B with RAG (O*NET/BLS data). The December 2025 audit covered Post-Processing only. The tables show what was assessed, what was found, and what could additionally be assessed with the proposed methodology.

What was assessed (actual audit)

Metric	What was done	What was found
Stereotype association	Sentiment analysis by group (Mann-Whitney U), qualitative review of outputs	No significant disparities in sentiment across gender, age, location. "Essential trait" language detected — subtly implying certain traits are universally required for careers.
Demographic parity	Sentiment equity across groups as proxy	Sentiment equitable. But actual career recommendations were not disaggregated by group — tone equity was assessed, not outcome equity.

What could additionally be assessed (proposed methodology)

Metric	What to do	What it would reveal
Stereotype association	Factorial vignette testing: identical student profiles differing only in gender/race. Run against deployed system.	Whether the system steers girls toward nursing or boys toward engineering. Whether "essential trait" language appears more for certain demographics.
Demographic parity	Measure career recommendation overlap, breadth, and diversity across factorial conditions. Disaggregate production recommendations by group.	Whether girls receive a narrower range of career recommendations than boys with identical interests. Whether actual careers recommended (not just tone) differ by group.

Appendix: LLM Pre/In Extension (Future)

Extension methodology. The LLM path focuses on Post-Processing. However, when access allows, Pre and In assessment can provide additional insight — particularly for systems using open-weight models where the base model can be tested independently. This is documented here as a future extension, not part of the core LLM methodology.

As LLMs increasingly replace ADMs for decision-like tasks (recommending, classifying, filtering), lifecycle tracking becomes more relevant. The extension would enable the "where did the problem enter?" analysis for LLM systems.

Access levels determine what's possible

Layer	Local model (e.g. Llama)	API-only (e.g. GPT-4o)	No model access
Base model	Run same tests as Post against raw model. Clean comparison.	Run same tests via API — results may reflect hidden provider interventions. Note limitation.	Use provider model card if available.
Fine-tuning data	Review prompt-response pairs for stereotypical patterns, manipulative language, factual errors		Not available
RAG corpus	Review content for accuracy, representativeness, stereotypical associations		Not available

Career Scoops: what a full-scope audit could additionally reveal

Because Career Scoops uses Llama (open weights, locally hosted), this is the best-case scenario for Pre→Post comparison — identical tests can be run against the base model and the deployed system.

Stage	What to test	What it would reveal
Pre: base model	Run factorial vignettes against raw Llama without RAG or prompts	Whether Llama already carries occupational stereotypes before Career Scoops adds anything
Pre: RAG corpus	Analyse BLS/O*NET data for demographic coverage and stereotypical patterns	Whether the "essential trait" language originates in the data
Configuration	Review system prompts and retrieval logic	Whether retrieval returns different content based on implicit demographic signals
Pre→Post comparison	Compare factorial test results: base model vs deployed system	Whether the deployment configuration (RAG + prompts) corrects or amplifies base model stereotypes