Standard setting and passing scores for high-stakes exams in the health professions
Abstract
Introduction: Highstakes summative examinations in the health professions underpin certification and licensure decisions. Cut scores must be valid, transparent, and defensible, balancing patient safety, fairness, and feasibility. These are needs that are especially relevant across Latin America.
Objective: To synthesize conceptual frameworks, methods, and practical considerations for standard setting and cutscore determination in certification exams, including psychometrics, legal defensibility, and consequences analysis, with a regional emphasis.
Method: Narrative review of international and regional literature, technical reports, and guidelines from certifying bodies. Approaches for knowledge tests (e.g., multiple-choice) and performance assessments (e.g., simulations) are compared, highlighting requirements, strengths, and limitations.
Results: Judge-based methods (modified Angoff, Ebel) are well suited when content is tightly curriculum-aligned; Bookmark leverages item response theory (IRT) to order items and set thresholds; Hofstee constrains acceptable pass/fail ranges; Beuk and hybrid approaches reconcile the standard with empirical difficulty. For performance assessments, Borderline Group and Borderline Regression are predominant. Critical factors include panel selection and training; use of empirical evidence (difficulty, discrimination, differential item functioning); equating across forms; estimating the standard error of measurement and confidence bands for pass/fail decisions; documentation and transparency; and monitoring subgroup impact for equity. Legal defensibility improves when standards are linked to competency descriptors, supported by an explicit validity argument, and accompanied by classification accuracy and consistency evidence.
Implementation in Latin America can benefit from faculty development and capacity building, method selection aligned to data availability (e.g., Angoff/Ebel when IRT calibration is not feasible; Bookmark when item banks exist), robust governance, and routine consequences analyses to ensure fairness and public trust.
Conclusions: No single method is universally superior. Hybrid processes, explicit validity arguments, clear governance, and ongoing psychometric and consequences monitoring strengthen defensibility and equity.






