Assessment results
Bias & Fairness
C
Score disparity by Spanish dialect is below the pre-declared 0.5-point practical-significance threshold (max gap 0.4 points on Pensamiento Critico). Gender-expression disparity exceeds the threshold on Aprendizaje Continuo (0.6-point gap, p=0.008). Qualitative coding of scoring rationales finds linguistic-style references appearing more often in lower-scoring conditions, indicating style-as-evidence patterns the auditor flags for follow-up.
Reliability
B
Test-retest reliability is high (ICC = 0.91 across 10 identical-input runs). Paraphrase stability is at the upper end of acceptable bounds (12% mean score divergence under semantic-preserving paraphrasing).
Privacy & Confidentiality
B
No personal data leakage detected in candidate-facing reports. A small number of conversation logs surface partial first names in the verbatim quotes used by the rationale; this is consistent with candidates self-disclosing in their responses, not with system-side retention or memorisation. Cross-session isolation tested on a small sample with no clear leak detected.
Security & Misuse
B
Prompt-injection robustness tested with 40 vignettes embedded in candidate responses; 92% safely refused or redirected. Two cases (5%) produced partial leakage of rubric structure in the rationale text. No successful score-manipulation observed.
Governance
C
Methodology, system architecture, and prompt structure are documented at a high level. Specific scoring rubric content (per competency, per maturity level) is not formally documented in client-accessible form. Prompt updates are tracked in version control but lack a formal pre-deployment validation step or rollback policy.