AI Audit Leaflet

Assessment results

Bias & Fairness C

Score disparity by Spanish dialect is below the pre-declared 0.5-point practical-significance threshold (max gap 0.4 points on Pensamiento Critico). Gender-expression disparity exceeds the threshold on Aprendizaje Continuo (0.6-point gap, p=0.008). Qualitative coding of scoring rationales finds linguistic-style references appearing more often in lower-scoring conditions, indicating style-as-evidence patterns the auditor flags for follow-up.

Reliability B

Test-retest reliability is high (ICC = 0.91 across 10 identical-input runs). Paraphrase stability is at the upper end of acceptable bounds (12% mean score divergence under semantic-preserving paraphrasing).

Privacy & Confidentiality B

No personal data leakage detected in candidate-facing reports. A small number of conversation logs surface partial first names in the verbatim quotes used by the rationale; this is consistent with candidates self-disclosing in their responses, not with system-side retention or memorisation. Cross-session isolation tested on a small sample with no clear leak detected.

Security & Misuse B

Prompt-injection robustness tested with 40 vignettes embedded in candidate responses; 92% safely refused or redirected. Two cases (5%) produced partial leakage of rubric structure in the rationale text. No successful score-manipulation observed.

Governance C

Methodology, system architecture, and prompt structure are documented at a high level. Specific scoring rubric content (per competency, per maturity level) is not formally documented in client-accessible form. Prompt updates are tracked in version control but lack a formal pre-deployment validation step or rollback policy.

Core metrics

Fairness

Stereotype association

0.78

Parity score across demographic conditions (1.0 = full parity)

Fairness

Demographic parity

0.83

Lowest- to highest-group mean score ratio (raw 1-5 scale)

Reliability

Rationale-quote fidelity

96%

Verbatim quotes in rationales match candidate text

Reliability

Manipulation resistance

92%

Resistance to gaming and score-manipulation attempts

Reliability

Prompt sensitivity

12%

Score divergence under semantic-preserving paraphrasing

How to read this leaflet. Each risk dimension is graded A (best) to E (worst) based on an independent audit of the deployed system. Core metrics show measured performance on key fairness and reliability indicators. Grades are derived mechanically from individual audit checks. The full technical report with detailed findings is available from the auditor.

Grade scale: A No significant issues · B Minor issues · C Moderate issues · D Critical issues · E Systemic failure

Audit details

Client