Precision Learning for the USMLE: A Randomized Controlled Trial of Traditional vs. AI-Powered Learning
Background: USMLE licensing examinations function as consequential gatekeepers for residency selection and specialty access, with implications for both individual students and the medical schools whose pass rates and match outcomes affect institutional standing. Nearly all US medical students prepare using the same self-directed question bank, without algorithmic adaptation to their specific gaps, timelines, or goals. Step 1 failure rates have risen 161% since 2022 and mean Step 2 CK scores have increased 10 points over the past decade, intensifying competitive pressure at a time when preparation methods have remained largely unchanged. No randomized trial has yet evaluated whether adaptive, personalized question bank preparation improves outcomes relative to traditional self-directed use.
Methods: 155 US medical students were randomized to an AI-powered adaptive QBank (Ora AI; n = 77) or to continue studying with UWorld (n = 78) for 14 days. The primary outcome was posttest score on a 60-item standardized NBME assessment, analyzed using ANCOVA adjusted for baseline score. Secondary outcomes included satisfaction, perceived metacognitive burden, and exploratory equity analyses by URiM status.
Results: Ora AI outperformed UWorld by 1.40 questions on the 60-item posttest (95% CI: 0.21–2.59; p = .021; d = 0.19), with consistent effects across baseline performance levels and exam cohorts. Ora AI participants reported significantly less time planning their studies (r = +0.39; p < .001), greater perceived improvement in weak topic areas (r = +0.35; p < .001), and higher platform enjoyment (r = +0.59; p < .001). Among underrepresented students, the estimated advantage trended larger (+1.78 questions) than among non-underrepresented students (+0.91 questions) — a directional pattern warranting investigation in a larger sample.
Conclusions: In the first randomized trial of adaptive versus traditional QBank preparation for USMLE examinations, an AI-powered adaptive platform produced a statistically significant improvement in practice examination performance over 14 days, alongside reductions in perceived study planning burden. These findings provide initial evidence for precision education as a viable approach to board preparation and identify important directions for future research.
Keywords: USMLE; adaptive learning; precision education; randomized controlled trial; medical education; board examination preparation
1. INTRODUCTION
The United States Medical Licensing Examination (USMLE) sequence serves as medicine's primary safeguard for physician competency—and a consequential gatekeeper for both individual students and the medical schools that train them 1. For students, performance on these examinations determines residency match success, specialty access, and career trajectory: approximately one in eight US medical students fails to match into their preferred specialty 23, and among those who do, nearly half fail to match their first-ranked program 4. In the most competitive specialties, match failure rates approach one in three 2. Critically, these figures represent a lower bound: students who recognize their exam performance is insufficient routinely self-select out of aspirational specialties and programs before they apply, rendering the true cost of inadequate preparation systematically invisible in published match statistics. For institutions, the stakes are equally concrete—aggregate board pass rates, score distributions, and match outcomes directly influence LCME accreditation standing, institutional reputation, and program competitiveness, with schools whose Step 1 pass rates or residency match rates fall below national benchmarks facing formal accreditation consequences 5. As the volume of testable medical knowledge expands at an exponential rate 6 while students' study time remains fixed, this shared challenge grows ever more acute—yet the dominant model of board preparation has remained fundamentally one-size-fits-all: a uniform library of practice questions delivered without regard for when a student's examination is scheduled, what specialty they are pursuing, what score they are targeting, or where their knowledge gaps lie. Just as precision medicine has transformed clinical care by tailoring treatment to the individual patient's biology, goals, and circumstances 7, precision education offers a complementary paradigm—one in which the learning experience adapts continuously and dynamically to each student's unique profile.
Medical students widely rely on digital question banks (QBanks) to navigate this challenge. Currently, UWorld stands as the predominant industry standard, with usage rates of 92–98% reported across surveys of US medical students preparing for USMLE Step 1 and Step 2 CK 89.
While traditional platforms like UWorld provide a high-quality, well-validated standardized curriculum, their foundational architecture—a static, self-directed repository of questions—presents inherent limitations for learning efficiency. Students typically engage with QBank content through self-directed study blocks, selecting topics based on their own perceived weaknesses and areas of interest. This approach relies heavily on metacognitive accuracy: students must correctly identify their own knowledge gaps and deliberately seek out content that challenges them. However, cognitive psychology and medical education literature consistently demonstrate that learners are poor at self-assessment 10, and consequently students frequently default to studying content they are already comfortable with, risking inefficient over-testing of mastered material while leaving genuine blind spots unaddressed 11. A second structural limitation is the fixed question pool: every user encounters the same finite set of content, which limits the depth of personalization available to any individual student. Furthermore, because traditional QBanks are not designed to evolve dynamically with a learner over time—tracking performance, adjusting difficulty, and targeting emerging gaps—they tend to be deployed in compressed, dedicated study periods rather than used longitudinally throughout training 12. This usage pattern promotes massed practice over the sustained spaced repetition that evidence consistently shows is superior for long-term knowledge retention 13.
Emerging artificial intelligence (AI) and adaptive learning platforms, such as Ora AI, aim to address these structural and logistical limitations through scalable precision education 14. Ora AI implements this paradigm through a personalized spaced repetition algorithm: the platform continuously tracks student performance at the learning objective level, resurfaces previously missed concepts using non-identical question variants to reinforce genuine understanding rather than answer memorization, and weights topic priority as a function of both clinical yield and individual performance gaps — constructing a custom curriculum that no two students experience identically. This approach may improve study efficiency relative to undirected self-study by eliminating the metacognitive burden of deciding what to study and ensuring that limited study time is directed toward the highest-impact gaps. Additionally, the highly scalable nature of AI architecture enables longitudinal deployment across the full arc of preclinical training—rather than the compressed, pre-exam window in which traditional QBanks are most commonly used—potentially allowing adaptive learning to reinforce knowledge through sustained spaced repetition from early in medical school.
The limitations of self-directed study also extend beyond academic efficiency to student mental health. Medical students experience high rates of burnout and imposter syndrome; a 2022 meta-analysis reported a pooled burnout prevalence of 37.2% across medical students globally 15, while imposter syndrome has been estimated to affect approximately 62% of health professional students and trainees 16. The uncertainty inherent in self-directed board preparation—whether one is targeting the right content, in the right depth, at the right time—may amplify these existing psychological burdens as an additional independent stressor. Conversely, AI-driven platforms that transparently track performance and algorithmically prescribe the next best study steps may provide a partial psychological safety net by eliminating the metacognitive guesswork of self-directed study. We therefore hypothesize that offloading the curriculum-planning burden to an adaptive algorithm reduces study-related anxiety alongside any direct effects on academic performance.
These challenges are not distributed equally across the medical student population. While adaptive learning benefits are broadly hypothesized to improve outcomes, they may be especially pronounced among structurally disadvantaged students. Despite the near-universal adoption of standard QBanks, significant testing disparities persist; students from racially and ethnically underrepresented groups and first-generation college students consistently face higher rates of Step 1 failure and lower Step 2 CK scores than their peers, disparities that are not explained by differences in cognitive ability 17. These gaps are frequently driven by overlapping structural barriers including stereotype threat 18, imposter syndrome 19, and a lack of generational knowledge regarding optimal board preparation strategies—an advantage disproportionately available to students from physician families or well-resourced academic environments 20. By eliminating the need for accurate self-diagnosis and algorithmically directing every study decision, AI-adaptive platforms may specifically attenuate the metacognitive burden that amplifies these barriers, potentially acting as a structural equalizer for students who face compounding disadvantages.
The urgency of this challenge has intensified markedly in recent years. Since USMLE Step 1 transitioned to pass/fail scoring in January 2022, first-time failure rates among US medical students have nearly tripled—from approximately 909 failures per year across the preceding four years to approximately 2,370 per year since the transition, representing a 161% increase despite only modest growth in the test-taking population 21. Simultaneously, mean Step 2 CK scores have risen by 10 points over the past decade (240 in 2014–15 to 250 in 2024–25), with the rate of increase accelerating markedly after Step 1 became pass/fail—and the minimum passing standard itself raised by 9 points (from 209 to 218) over this same period 222324. In this environment, the Step 2 CK numeric score has emerged as the primary objective differentiator for residency selection, with 90% of program directors reporting that USMLE results influence their interview selection decisions 25. Students are navigating a landscape that is simultaneously harder on both ends—more failing the floor examination and competing against a rising ceiling on the score examination—at exactly the moment that traditional preparation methods remain unchanged.
While the principles of algorithmic spaced repetition are well-validated in medical education for the retention of discrete facts (e.g., flashcard-based platforms) 11, the application of these adaptive algorithms to complex, higher-order clinical vignettes remains largely unexplored in formal, controlled settings. Given that QBanks represent the primary preparation tool for virtually all medical students approaching these consequential examinations, the evidence base for adaptive approaches in this high-stakes, high-volume context is surprisingly sparse.
To address this gap in the literature, we conducted a randomized controlled trial comparing the short-term efficacy of an AI-powered adaptive platform (Ora AI) directly against the industry-standard traditional QBank (UWorld). We hypothesized that (1) Ora AI would yield measurably superior improvements in practice test performance over a two-week study period; (2) Ora AI would reduce perceived metacognitive burden and test-related anxiety; and (3) these benefits would be especially pronounced among structurally disadvantaged students, for whom algorithmic study direction may specifically relieve the compounding burdens of stereotype threat, imposter syndrome, and limited navigational capital.
2. METHODS
2.1 Study Design
We conducted a two-arm, parallel-group, randomized controlled trial comparing an AI-powered adaptive QBank (Ora AI) with the industry-standard traditional QBank (UWorld) among medical students preparing for USMLE Step 1 or Step 2 CK. The primary outcome was posttest score (raw score, 0–60 questions correct) on a standardized 60-question multiple-choice assessment, analyzed using analysis of covariance (ANCOVA) with pretest score, exam cohort, assessment order, and total questions completed as covariates. This study was conducted during the summer of 2025 to minimize interference with formal curricular responsibilities. The trial was deemed exempt from full IRB review under 45 CFR 46.104(d)(1) and (2), as it involved research conducted in an educational setting using standard educational tools and survey procedures, with no collection or storage of identifiable private information. All participants provided informed electronic consent prior to enrollment.
2.2 Participants
Eligible participants were currently enrolled US medical students who were actively preparing for USMLE Step 1 or Step 2 CK, currently studying with UWorld, and able to commit to the full study duration. Participants were required to have never previously completed the NBME Free 120, as this assessment served as the source of all study outcome measures. Participants were excluded if they were unable to complete baseline or outcome assessments, or if they did not meet prespecified study engagement thresholds (see Section 2.6). Participants were recruited via publicly accessible medical student forums and institutional listservs composed of students who had opted in to receive research or educational communications. All participants were assigned anonymized usernames upon enrollment to maintain confidentiality; no names, email addresses, or device identifiers were linked to study data.
2.3 Sample Size
The study was powered to detect a clinically meaningful difference in posttest performance between groups. A sample size of 198 participants (99 per group) was estimated to provide 80% power to detect a difference of 2 questions on a 60-item outcome assessment (corresponding to Cohen's d = 0.4), assuming a standard deviation of 5.25 and accounting for variance reduction from ANCOVA adjustment for baseline scores. A total of 155 students were enrolled. Of these, 138 completed all study components and were included in the primary per-protocol analysis (Ora AI: n = 72; UWorld: n = 66).
2.4 Randomization and Blinding
Following completion of the pretest, eligible participants were stratified by exam level (USMLE Step 1 or Step 2 CK) and randomly assigned in a 1:1 ratio to either the Ora AI group or the UWorld group using a computer-generated allocation sequence. Allocation was concealed from study personnel at the point of enrollment. Participants registered using anonymized usernames prior to group assignment. Given the nature of the intervention, blinding of participants to study arm was not possible; however, all outcome assessments were administered and scored without knowledge of group assignment.
2.5 Assessments
Each participant completed two assessments: a pretest administered on Day 1 prior to the intervention period, and a posttest administered on Day 16 following the two-week study period. Each assessment consisted of 60 multiple-choice questions drawn from the publicly available NBME Free 120, a set of 120 sample licensing examination questions published by the National Board of Medical Examiners (NBME) for each exam level and explicitly made available for educational and research purposes under the NBME Terms of Use. Questions were representative of those encountered on the USMLE Step 1 and Step 2 CK examinations. Participants self-selected into the Step 1 or Step 2 CK cohort based on their examination preparation needs and received content appropriate to their cohort.
To control for variation in question difficulty and to eliminate test–retest bias, we employed a counterbalanced block design. For each exam level, the 120 NBME Free 120 questions were divided into two distinct 60-question blocks (Block A and Block B) based on the original order of questions in the official NBME release. Participants within each cohort were further randomized to one of two assessment orders: Block A as pretest and Block B as posttest (Order 1), or Block B as pretest and Block A as posttest (Order 2). This structure ensured that all participants encountered the same total number of questions without repeating any items across the pre- and posttest, while statistically balancing any systematic differences in block difficulty through inclusion of assessment order as a covariate in all analyses.
2.6 Intervention
Following the pretest, participants in the Ora AI group were granted access to the Ora AI platform for the duration of the 14-day study period. Ora AI implements a personalized spaced repetition curriculum that operates at the level of individual learning objectives. Each learning objective is represented by multiple non-identical question variants, allowing repeated exposure to the same concept from different clinical perspectives without permitting simple answer memorization. The platform tracks each student's performance at the learning objective level — integrated with subject- and system-level performance data — and uses this information to determine both when to introduce new learning objectives and when to resurface previously encountered objectives for spaced repetition review. Topic priority is weighted as a function of the clinical yield of each learning objective relative to USMLE examination content specifications and the student's demonstrated performance gaps, with the curriculum adjusting continuously as new data accumulates across each study session. Participants in the control group were instructed to study using UWorld and to do so as they normally would, without restriction on mode or topic selection.
To be included in the primary per-protocol analysis, participants in both groups were required to study on at least 10 of the 14 intervention days and to average at least 40 questions per study day (minimum 400 questions total across study days). Adherence for Ora AI participants was verified via automated, anonymized backend platform logs. Adherence for control group participants was assessed via self-report through a structured post-period survey administered at the end of the study period.
2.7 Follow-Up Survey
Following completion of the posttest, all participants completed a structured survey comprising four components.
First, participants self-reported study engagement during the intervention period, including the number of days studied, total questions completed, average daily study hours, and any supplementary resources used. For UWorld participants, these responses served as the primary source of adherence data.
Second, participants rated eleven statements on a 5-point Likert scale (1 = Strongly Disagree, 5 = Strongly Agree) reflecting platform-specific outcomes across three thematic domains: (1) satisfaction and intent to continue ("I enjoyed studying with my assigned QBank"; "When preparing for my real exams, I would continue using my assigned QBank"; "After studying with my assigned QBank, I am less likely to postpone my official exam"); (2) perceived learning and exam alignment ("My weak topic areas improved during the study period"; "My strong topic areas improved during the study period"; "My weak topic areas improved more than my strong topic areas"; "The content areas covered in my assigned QBank closely matched those on official NBME/USMLE exams"; "The style and wording of questions in my assigned QBank felt similar to official NBME/USMLE questions"; "The difficulty of questions in my assigned QBank felt similar to official NBME/USMLE questions"); and (3) psychological and metacognitive outcomes ("Compared with before the study, I now feel more confident (less anxious) about upcoming exams"; "During the study period, I spent less time planning my studies than I typically do").
Third, participants completed a nine-item demographic and equity inventory assessing: URM or disadvantaged background status (self-identification); disability, chronic health condition, or learning difference; history of formal academic accommodations; mental health impact on studying in the past year; first-generation college student status; immigrant or first-generation American status; English as a second language (ESL) status; gender (Woman, Man, Non-binary/gender diverse, Prefer not to say, Other); and race/ethnicity (select all that apply).
Finally, participants provided an open-ended qualitative response describing what they would tell another medical student about their assigned QBank. Qualitative responses will be analyzed thematically 26.
2.8 Statistical Analysis
The primary outcome was posttest score (raw score, 0–60 questions correct), analyzed using analysis of covariance (ANCOVA) with ordinary least squares. Prior to fitting the primary model, we tested the homogeneity of regression slopes assumption by examining the arm × pretest interaction; if significant (p < .05), an interaction model was used and group differences were reported at the 10th, 25th, 50th, 75th, and 90th pretest score percentiles. The primary ANCOVA model included group assignment (Ora AI vs. UWorld) as the primary predictor; pretest score as the primary covariate; and assessment order (Order 1 vs. Order 2), exam cohort (Step 1 vs. Step 2 CK), and total questions completed during the study period as additional covariates. Variance inflation factors (VIFs) were assessed to check for multicollinearity. When model residuals were non-normally distributed (Shapiro-Wilk p < .05), bias-corrected and accelerated (BCa) bootstrapped 95% confidence intervals were computed (B = 10,000). The primary analysis was conducted on the per-protocol population (participants meeting engagement criteria). An intention-to-treat sensitivity analysis including all randomized participants was also conducted. Post-hoc power was computed from the realized sample size and observed effect size.
Two prespecified primary secondary outcomes were tested with Bonferroni correction (α = 0.025 each): (1) platform adoption intent, assessed by the single item "When preparing for my real exams, I would continue using my assigned QBank" (Wilcoxon rank-sum test); and (2) metacognitive burden relief, assessed by a composite of three items—"I now feel more confident (less anxious)," "I spent less time planning my studies," and "I am less likely to postpone my official exam"—provided Cronbach's alpha across the three items was ≥ 0.70; if alpha < 0.70, the single item "I now feel more confident (less anxious)" was used alone. For all Wilcoxon tests, rank-biserial correlation (r) was reported as the effect size. The remaining nine Likert items were treated as exploratory secondary outcomes, with Benjamini-Hochberg (BH) FDR correction applied within each thematic domain. Medians and interquartile ranges are reported per group.
We examined treatment effect modification by baseline performance using a continuous arm × pretest score interaction term in the primary ANCOVA model, with group differences reported at pretest score percentiles. This approach was prespecified in place of a median split to avoid arbitrary dichotomization of a continuous variable 27. We additionally examined whether the treatment effect was differentially distributed among structurally disadvantaged students using a single prespecified composite moderator variable, defined per the AAMC FACTS Glossary definition of underrepresented in medicine: TRUE if a participant's self-reported race/ethnicity included Black or African American, Hispanic or Latino, Native American or Alaska Native, or Native Hawaiian or Pacific Islander 28. Two exploratory subgroup analyses examined whether the arm × mental health impact interaction was significant; these were BH-corrected within the exploratory tier and are hypothesis-generating only.
We additionally examined whether the primary treatment effect was mediated by reduced metacognitive burden using the counterfactual mediation framework (Imai et al. 2010) implemented in the `mediation` R package, with BCa bootstrapped confidence intervals (B = 1,000) for the average causal mediation effect (ACME). Complementary OLS interaction models tested whether the arm × structural disadvantage interaction extended to the psychological secondary outcomes (confidence and planning burden), and moderated mediation analyses examined whether the indirect pathway was stronger among structurally disadvantaged students.
Three sensitivity analyses were conducted: (1) a two-dimensional per-protocol threshold sensitivity analysis varying the study-days threshold (8–12 days) and questions-per-day threshold (25–50 questions/day) simultaneously; (2) a primary analysis excluding the questions-completed covariate to estimate the total effect net of differential engagement; and (3) a primary analysis excluding participants with suspiciously low but above-chance assessment scores (13–20/60, above the expected value of random guessing but inconsistent with genuine medical student engagement) or implausible gain scores (|gain| > 2 SD from the mean gain). Note: participants scoring at or below the expected value of random guessing on a 5-option MCQ (≤ 12/60, the expected value of Binomial(60, 0.20)) were hard-excluded from all analyses, including ITT, as scores in this range do not constitute valid assessment data.
All statistical analyses were performed using R (version 4.5.3; R Foundation for Statistical Computing, Vienna, Austria) with packages emmeans, car, effectsize, mediation, and tableone. All tests were two-sided. Statistical significance was defined as p < .05, except where Bonferroni correction was applied to the two primary secondary outcomes (p < .025 each).
3. RESULTS
3.1 Participant Flow and Baseline Characteristics
| Characteristic | Ora AI (n = 77) | UWorld (n = 78) | p |
|---|---|---|---|
| Pretest score, mean (SD) | 34.0 (7.9) | 34.2 (8.5) | .877 |
| USMLE cohort, n (%) | .800 | ||
| Step 1 | 43 (55.8%) | 42 (53.8%) | |
| Step 2 CK | 34 (44.2%) | 36 (46.2%) | |
| Assessment order: Block A→B, n (%) | 39 (50.6%) | 39 (50.0%) | 1.000 |
| Study days completed, mean (SD) | 11.7 (1.6) | 10.5 (1.7) | <.001 |
| Questions completed, mean (SD) | 445 (184) | 489 (187) | .143 |
Data shown for all enrolled participants (n = 155). Per-protocol sample: n = 138 (Ora AI n = 72, UWorld n = 66). Categorical comparisons by χ²; continuous by independent-samples t-test. SD = standard deviation.
A total of 155 participants were enrolled and randomized (Ora AI: n = 77; UWorld: n = 78). All 155 enrolled participants completed the posttest assessment. Seventeen participants (11.0%) did not meet per-protocol engagement criteria (fewer than 10 study days), leaving a per-protocol sample of 138 participants (Ora AI: n = 72; UWorld: n = 66). Dropout was higher in the UWorld arm (15.4%) than the Ora AI arm (6.5%). Baseline characteristics were well-matched between arms, including pretest score (Ora AI mean: 34.0/60; UWorld mean: 34.2/60; t = 0.16, p = .877) and cohort distribution (Step 1: n = 85, 54.8%; Step 2 CK: n = 70, 45.2%). Study engagement differed significantly between arms: Ora AI participants studied a mean of 11.7 days (SD = 1.6) compared with 10.5 days (SD = 1.7) for UWorld participants (p < .001), while total questions completed did not significantly differ (Ora AI: 445, SD = 184; UWorld: 489, SD = 187; p = .14).
3.2 Primary Outcome
ANCOVA-adjusted means (posttest ~ arm + pretest + cohort + order + questions). Positive values favour Ora AI. Error bars = 95% confidence intervals.
* p = .021 | *** p < .001. ITT = intention-to-treat; PP = per-protocol (≥10 study days, pretest > chance).
In the per-protocol analysis (n = 138), Ora AI participants outperformed UWorld participants on the posttest assessment after adjusting for pretest score, exam cohort, assessment order, and total questions completed (estimated mean difference: +1.40 questions; 95% CI: 0.21–2.59; p = .021; Cohen's d = 0.19 [−0.14, 0.53]). The intention-to-treat analysis (n = 155) yielded a larger and more precise estimate (+1.90 questions; 95% CI: 0.79–3.01; p < .001), likely reflecting the higher rate of non-engagement among UWorld participants who were nonetheless retained in the ITT sample. Post-hoc power for the primary analysis was 64%.
Pretest scores did not significantly moderate the treatment effect (arm × pretest interaction: p = .82), indicating that Ora AI's benefit was consistent across baseline performance levels. Estimated differences at the 10th, 25th, 50th, 75th, and 90th pretest percentiles were +1.21 (p = .25), +1.30 (p = .085), +1.40 (p = .022), +1.50 (p = .044), and +1.58 (p = .10) questions, respectively. Exam cohort did not significantly moderate the treatment effect (arm × cohort interaction: p = .78); estimated effects were +1.55 questions for Step 1 (p = .055) and +1.21 questions for Step 2 CK (p = .18).
3.3 Sensitivity Analyses
The primary result was robust to removal of the questions-completed covariate (+1.51 questions; p = .012), indicating that the observed effect was not an artifact of differential adjustment for study volume. The result was sensitive to exclusion of participants with implausible assessment performance: removing 18 participants flagged for near-chance scores or extreme gain scores attenuated the estimate (+0.92; p = .097), though the direction remained consistent. The two-dimensional per-protocol threshold sensitivity analysis (varying study-day and questions-per-day thresholds simultaneously) revealed that statistical significance was sensitive to the per-protocol definition, particularly at lower day thresholds; results were most stable at or above 10 study days.
3.4 Primary Secondary Outcomes
Cronbach's alpha for the prespecified metacognitive burden composite (confidence, planning burden, and postponement likelihood) was 0.41, below the prespecified threshold of 0.70; accordingly, the composite was not formed and the single item "I now feel more confident (less anxious) about upcoming exams" was used as the second primary secondary outcome.
Platform adoption intent (lk_would_continue) did not differ significantly between arms (Ora AI median: 3.0; UWorld median: 4.0; rank-biserial r = −0.19; Bonferroni-adjusted p = .095). Confidence/anxiety relief (lk_more_confident) favored Ora AI (Ora AI median: 4.0; UWorld median: 3.0; r = +0.22; Bonferroni-adjusted p = .037) but fell just short of the prespecified Bonferroni-corrected threshold of p < .025.
3.5 Exploratory Secondary Outcomes
Filled circles show median response (1–5 scale); items ordered by rank-biserial effect size (rrb). Significance after BH correction within domain.
* p < .05 | *** p < .001 (Wilcoxon rank-sum, BH-corrected within domain). Final item (would continue) assessed at Bonferroni-corrected threshold; favoured UWorld non-significantly.
Among the 11 exploratory Likert items, Ora AI participants rated significantly higher enjoyment (r = +0.59; p < .001), perceived improvement in weak topic areas (r = +0.35; p < .001), improvement in strong topic areas (r = +0.25; p = .011), stronger improvement in weak relative to strong areas (r = +0.27; p = .011), and reduced time spent planning studies (r = +0.39; p < .001). Confidence relief approached but did not reach significance after BH correction (r = +0.22; p = .018). No significant differences were observed for exam content alignment, question style match, or likelihood to postpone the official exam. Notably, UWorld participants rated higher intent to continue using their assigned QBank compared with Ora AI participants, a pattern likely reflecting differential baseline familiarity with the two platforms.
3.6 Mediation Analysis
Ora AI assignment was associated with significantly increased posttest scores (Step 1: b = +1.40; p = .021) and with greater confidence (Step 2: b = +0.38; p = .019). However, confidence did not predict posttest score after controlling for arm (Step 3: b = −0.13; p = .69), and the average causal mediation effect was negligible and non-significant (ACME = −0.05; 95% BCa CI: −0.34, 0.16; p = .71). Confidence does not appear to mediate the primary score improvement.
3.7 Equity Subgroup Analyses
The prespecified equity moderator (URiM status per AAMC FACTS Glossary: n = 29; 18.7% of sample) was tested as a moderator of the primary treatment effect. Cell sizes were insufficient for inferential testing (n < 20 per arm); results are therefore reported descriptively. Among URiM participants, Ora AI outperformed UWorld by an estimated +1.78 questions (95% CI: 0.20–3.36; p = .027); among non-URiM participants, the estimated advantage was +0.91 questions (95% CI: −0.89–2.70; p = .32). The arm × URiM interaction was not statistically significant (F = 0.53; p = .47), consistent with insufficient power to detect differential subgroup effects. This directional pattern — suggesting potentially larger benefits among URiM students — is hypothesis-generating and warrants investigation in adequately powered future studies.
4. DISCUSSION
4.1 Principal Findings
In this randomized controlled trial comparing an AI-powered adaptive QBank (Ora AI) with the industry-standard traditional QBank (UWorld) among 155 medical students, Ora AI produced a statistically significant improvement in standardized practice examination performance over a 14-day study period (+1.40 questions on a 60-item assessment; p = .021). The effect was consistent across baseline performance levels and exam cohorts, and was confirmed in the intention-to-treat analysis. Ora AI participants also reported significantly less time spent planning their studies, perceived greater improvement in weak topic areas, and rated higher enjoyment of the platform. These findings provide the first randomized evidence that adaptive, personalized question bank preparation may yield measurable performance advantages over self-directed traditional QBank use in the context of USMLE board preparation.
4.2 Interpretation of Effect Size and Clinical Relevance
The observed effect size was modest (Cohen's d = 0.19), corresponding to approximately 2.9 points on the USMLE Step 2 CK scale (d × SD of 15) and movement of approximately 3–4 percentile points within the normative distribution of LCME-accredited medical students 24. While modest by conventional standards, even small improvements in examination performance carry disproportionate consequences for individual students: approximately one in eight US medical students fails to match into their preferred specialty 23, and standardized examination scores remain among the most heavily weighted factors in residency interview selection 25. In this context, marginal score improvements may have meaningful downstream effects on career trajectories that would not be captured by effect size alone.
4.3 Mechanisms
The secondary outcome data point to two distinct mechanisms through which Ora AI may exert its benefit. First, the significant reduction in perceived planning burden (r = +0.39; p < .001) suggests that Ora AI's algorithmic direction relieved the metacognitive demands of self-directed board preparation, consistent with our hypothesis that adaptive platforms reduce the cognitive load of deciding what to study. Second, the greater perceived improvement in weak topic areas (r = +0.35; p < .001) — and the preferential improvement in weak relative to strong areas (r = +0.27; p = .011) — indicates that Ora AI's adaptive targeting was directionally effective at concentrating study effort where it was most needed.
Formal mediation analysis did not support confidence relief as a pathway through which score improvements operated — the indirect effect via confidence was negligible (ACME = −0.05; p = .71). The score benefit appears to be a direct learning effect rather than one mediated by reduced anxiety. Confidence may be an independent co-benefit of platform use rather than a mechanism of score improvement.
The finding that UWorld participants expressed higher intent to continue using their assigned QBank is noteworthy and likely reflects differential baseline familiarity: participants were already studying with UWorld at enrollment, and the two-week study period was insufficient to build comparable comfort with the Ora AI interface. This result should not be interpreted as evidence of lower satisfaction with Ora AI's learning efficacy, and is consistent with the substantially higher enjoyment ratings among Ora AI users (r = +0.59; p < .001).
4.4 Equity Considerations
A prespecified analysis examined whether treatment effects were differentially distributed among racially and ethnically underrepresented students (defined per AAMC FACTS Glossary). Although the study was not powered to detect subgroup interactions — and the arm × URiM interaction did not reach statistical significance (p = .47) — the observed pattern was directionally consistent with the equity hypothesis: Ora AI's estimated advantage was larger among URiM students (+1.78 questions; p = .027) than non-URiM students (+0.91 questions; p = .32). The hypothesis that adaptive platforms may specifically attenuate the metacognitive burden that amplifies stereotype threat and imposter syndrome among underrepresented students 1819 remains an important and largely untested question in medical education. A prospective study adequately powered for subgroup comparisons — requiring substantially larger enrollment than the current trial — is warranted.
4.5 Limitations
Several limitations warrant consideration. First, the study period was limited to 14 days at moderate intensity, representing a fraction of a typical board preparation period; the long-term effects of adaptive learning platforms on examination performance remain unknown. Second, study engagement was assessed by participant self-report for UWorld participants and by automated platform logs for Ora AI participants — an asymmetry that may introduce differential measurement error. Third, the per-protocol result was sensitive to participant engagement definitions and to exclusion of participants with implausible assessment performance; the primary finding should therefore be interpreted with appropriate caution until replicated in a larger sample. Fourth, participants were not restricted from using supplementary resources during the study period, and differences in such use may have confounded the observed group differences. Fifth, both assessments were drawn from the publicly available NBME Free 120, which — while representative of USMLE content — was not psychometrically equated between blocks; block difficulty differences were statistically balanced through the counterbalanced design and covariate adjustment but cannot be ruled out as a residual source of variability. Sixth, the study was conducted during a discrete summer period and enrolled only students who were already preparing for Step 1 or Step 2 CK, limiting generalizability to other stages of medical training.
4.6 Conclusions
This randomized controlled trial provides evidence that adaptive, personalized QBank preparation produces measurable improvements in USMLE-style examination performance relative to traditional self-directed QBank use over a two-week study period. The effect was modest but consistent across performance levels, supported by secondary outcomes indicating reduced planning burden and enhanced targeting of knowledge gaps. Future research should examine whether these benefits extend across longer study periods, whether they are amplified among underrepresented medical students, and whether the mechanisms identified here — reduced metacognitive burden and adaptive gap targeting — independently account for the observed performance advantage.
References
- United States Medical Licensing Examination. About the USMLE. USMLE Program; 2025. Available at: https://www.usmle.org/about-usmle
- National Resident Matching Program. Charting Outcomes in the Match: U.S. MD Seniors, 2024. NRMP; 2024.
- National Resident Matching Program. Charting Outcomes in the Match: U.S. DO Seniors, 2024. NRMP; 2024.
- Chandra S, Chandran N, Lam A, et al. Student academic performance factors affecting matching into first-choice residency and competitive specialties. J Osteopathic Med. 2019;119(9):606-616.
- Triola MM, Pusic MV. Exploring LCME's new USMLE norms of accomplishment. Acad Med. 2025;100(5):561-566.
- Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med. 2018;93(8):1107-1109.
- Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793-795.
- Smith J, et al. Exploring students' use of medical education resources for USMLE Step 2 CK exam preparation. Med Educ Online. 2025;30:2484869.
- Huang GW, et al. Unlocking medical student success: a systematic review of third-party resources for USMLE board preparation. BMC Med Educ. 2025;25:35.
- Davis DA, Mazmanian PE, Fordis M, et al. Accuracy of physician self-assessment compared with observed measures of competence. JAMA. 2006;296(9):1094-1102.
- Augustin M. How to learn effectively in medical school: test yourself, learn actively, and repeat in intervals. Yale J Biol Med. 2014;87(2):207-212.
- Bell SK, Krupat E, Fazio SB, Roberts DH, Schwartzstein RM. Longitudinal pedagogy: a successful response to the fragmentation of the third-year medical student clerkship experience. Acad Med. 2008;83(5):467-475.
- Sanderson BJ, Sanderson CJ, Lim RL. Systematic review of distributed practice and retrieval practice in health professions education. Adv Health Sci Educ Theory Pract. 2024;29(2):689-714.
- Fontaine G, Cossette S, Maheu-Cadotte MA, et al. Efficacy of adaptive e-learning for health professionals and students: a systematic review and meta-analysis. BMJ Open. 2019;9(8):e025252.
- Houpy JC, Lee WW, Woodruff JN, Pincavage AT. Prevalence of burnout in medical students: a systematic review and meta-analysis. Med Sci Educ. 2022;32(4):955-963.
- Chua S, Sim K, Tan CH. Global prevalence of imposter syndrome in health service providers: a systematic review and meta-analysis. Front Psychiatry. 2025;16:1520453.
- Jones AC, Nichols AC, McNicholas CM, Stanford FC. Admissions is not enough: the racial achievement gap in medical education. Acad Med. 2021;96(2):176-181.
- Steele CM, Aronson J. Stereotype threat and the intellectual test performance of African Americans. J Pers Soc Psychol. 1995;69(5):797-811.
- Chua S, Tan IYK, Thummachai ME, Chew QH, Sim K. Impostor syndrome, associated factors and impact on well-being across medical undergraduates and postgraduate professionals: a scoping review. BMJ Open. 2025;15(7):e097858.
- Martinez J. Key disparities between first-generation and continuing-generation medical students: a quantitative analysis. BMC Med Educ. 2025;25:955.
- United States Medical Licensing Examination. Performance Data. USMLE Program; 2025. Available at: https://www.usmle.org/performance-data
- Louisiana State University Health Sciences Center. National Benchmarks, AY 2014–15 through 2018–19. LSUHSC; 2019.
- National Board of Medical Examiners. Step 2 CK Annual School Report: Sample 2022. NBME; 2022.
- United States Medical Licensing Examination. USMLE Score Interpretation Guidelines. USMLE Program; updated 2025.
- National Resident Matching Program. Results of the 2024 NRMP Program Director Survey. NRMP; 2024.
- Braun V, Clarke V. Using thematic analysis in psychology. Qual Res Psychol. 2006;3(2):77-101.
- Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25(1):127-141.
- Association of American Medical Colleges. FACTS Glossary: Underrepresented in Medicine. AAMC; updated 2025. Available at: https://www.aamc.org/data-reports/students-residents/data/facts-glossary