AI Audit Leaflet

Independent AI Assessment
System: Wiselook
Version: 0.1-synthetic
Type: LLM
Domain: HR / Talent assessment
Owner: Wiselook S.L.
Risk level: High
Assessment results
Bias & Fairness C
Score disparity by Spanish dialect is below the pre-declared 0.5-point practical-significance threshold (max gap 0.4 points on Pensamiento Critico). Gender-expression disparity exceeds the threshold on Aprendizaje Continuo (0.6-point gap, p=0.008). Qualitative coding of scoring rationales finds linguistic-style references appearing more often in lower-scoring conditions, indicating style-as-evidence patterns the auditor flags for follow-up.
Reliability B
Test-retest reliability is high (ICC = 0.91 across 10 identical-input runs). Paraphrase stability is at the upper end of acceptable bounds (12% mean score divergence under semantic-preserving paraphrasing).
Privacy & Confidentiality B
No personal data leakage detected in candidate-facing reports. A small number of conversation logs surface partial first names in the verbatim quotes used by the rationale; this is consistent with candidates self-disclosing in their responses, not with system-side retention or memorisation. Cross-session isolation tested on a small sample with no clear leak detected.
Security & Misuse B
Prompt-injection robustness tested with 40 vignettes embedded in candidate responses; 92% safely refused or redirected. Two cases (5%) produced partial leakage of rubric structure in the rationale text. No successful score-manipulation observed.
Governance C
Methodology, system architecture, and prompt structure are documented at a high level. Specific scoring rubric content (per competency, per maturity level) is not formally documented in client-accessible form. Prompt updates are tracked in version control but lack a formal pre-deployment validation step or rollback policy.
Core metrics
Fairness
Stereotype association
0.78
Parity score across demographic conditions (1.0 = full parity)
Fairness
Demographic parity
0.83
Lowest- to highest-group mean score ratio (raw 1-5 scale)
Reliability
Rationale-quote fidelity
96%
Verbatim quotes in rationales match candidate text
Reliability
Manipulation resistance
92%
Resistance to gaming and score-manipulation attempts
Reliability
Prompt sensitivity
12%
Score divergence under semantic-preserving paraphrasing
How to read this leaflet. Each risk dimension is graded A (best) to E (worst) based on an independent audit of the deployed system. Core metrics show measured performance on key fairness and reliability indicators. Grades are derived mechanically from individual audit checks. The full technical report with detailed findings is available from the auditor.

Grade scale: A No significant issues · B Minor issues · C Moderate issues · D Critical issues · E Systemic failure