Cross-School QBank Benchmarks Show How Each School’s Students Compare to 19 Others on Identical Content, System by System.
Ora AI Research Team. Cross-school performance variation on shared content (anonymized).
Curriculum committees are flying partially blind: they see licensing outcomes, surveys, and curriculum documentation, but not topic-level performance against a national peer set answering the same formative questions. Across 20 anonymized US medical schools, 11 of 16 organ systems vary beyond chance on identical content, with a between-school standard deviation of about 5 percentage points per system after sampling noise is removed. This is the first cross-institutional, topic-level, formative-content benchmark at scale in US medical education.
the benchmark
identical questions
after noise removal
school-level signal
Raw cross-school accuracy range spans ~21–32 pp across systems, but much of that is sampling variation. Per-system noise expectation 5.2–6.2 pp (dashed line = median ~5.6 pp). Six representative systems shown of 16.
Permutation null, 5,000 relabelings per system. The between-school component is distinguishable from sampling noise in 11 of 16 systems (p<0.05), strengthening to 13 of 16 at a stricter per-school stability floor.
On identical formative content, the school-level signal is statistically meaningful in 11 of 16 organ systems. After removing the spread expected from finite samples, the between-school standard deviation is about 5 percentage points per system, with the largest noise-adjusted gaps in Renal (6.1 pp), Endocrine (5.6 pp), and Cardiovascular (5.2 pp). This is new information curriculum committees do not currently receive from any standard report.
The missing report
Medical schools already receive data, but the standard views leave a sharp blind spot. NBME's USMLE school reports are private to each school and centered on licensing performance, not precise topic-by-topic formative benchmarking against other schools answering the same questions.1 The AAMC Graduation Questionnaire tells schools what students report about their education.2 LCME accreditation documents whether a curriculum meets required standards and covers expected domains.3 None of these tells a curriculum committee whether its students are unusually strong or weak in Renal, Endocrine, Cardiovascular, or Respiratory performance relative to a national peer set on identical formative content.
That is the gap this analysis fills. It treats Ora's shared question bank as a common measurement surface: same organ-system taxonomy, same formative content, and anonymized school-level performance across 20 verified US/PR medical schools. The result is not a leaderboard and not a school-quality label. It is a curriculum signal. Every school in the sample has both stronger and weaker systems, which is exactly why the view matters. A dean does not need another pass-rate summary to ask better questions. They need to see where their students sit, system by system, against the national picture.
The largest noise-adjusted gaps cluster in core preclinical systems (Renal, Endocrine, and Cardiovascular) and can reflect curriculum sequence, cohort timing, study behavior, user mix, or genuine instructional strengths. The point is simpler: without this kind of formative benchmark, the committee cannot see the pattern at all.
How the benchmark was built
- 20 schools, randomly drawn with a pre-registered seed from verified US/PR schools meeting a fixed response floor, not selected as top-N usage sites.
- Anonymized as letters. The brief publishes no name, count, geography, size, institution type, or adoption timing.
- 16 organ systems measured on shared Ora qbank content using the standard NBME-blueprint depth-2 taxonomy.
- First-attempt, submitted, graded responses; per-school accuracy pooled per system.
- Cross-school dispersion computed as SD and range per system. No per-school value, cell, or heatmap is reported.
- Headline metric is the noise-adjusted between-school SD and the median-system spread.
- Permutation null with 5,000 relabelings per system estimates the spread expected from finite samples; the reported signal is observed minus that expectation.
- Threshold sensitivity across 50 to 300 responses shows the real signal stabilizing near 5 pp and strengthening as estimates stabilize.
- Descriptive, not causal.
This is an observational, descriptive analysis, and the sample is not representative: the schools are drawn from those with substantial Ora engagement, and within each school Ora users self-select rather than representing the full student body. Cross-school differences may partly reflect who opts in, including year in program and study intensity, rather than the school itself. The ~5 pp residual survives a sampling-noise check but is not adjusted for incoming-student characteristics such as MCAT or GPA, which are not in the data. "Identical content" means the shared Ora corpus; per-user content filters may surface different subsets within a system. To guarantee no school is identifiable, the brief reports only aggregate cross-school distributions. No school is named, ranked, counted, or shown as an individual cell, which necessarily limits the resolution a reader can inspect. The variation documented here is measured-on-Ora performance on specific systems, not a claim about overall medical-education quality; all schools in the sample train competent physicians.
References
- National Board of Medical Examiners. 2022 USMLE School Report Enhancements and Commonly Asked Questions: Enhanced USMLE School Score Reports (Jan 2023). nbme.org/2022-usmle-school-report-enhancements
- Association of American Medical Colleges. Medical School Graduation Questionnaire (GQ): All Schools Summary Report. aamc.org/…/graduation-questionnaire-gq
- Liaison Committee on Medical Education. Functions and Structure of a Medical School: Standards for Accreditation. lcme.org/publications