Cross-School QBank Benchmarks Show How Each School's Students Compare to 19 Others on Identical Content, System by System.

Ora AI Research Cross-School Benchmarks

Research · Cross-School Benchmarks

Cross-School QBank Benchmarks Show How Each School’s Students Compare to 19 Others on Identical Content, System by System.

Ora AI Research Team. Cross-school performance variation on shared content (anonymized).

Curriculum committees are flying partially blind: they see licensing outcomes, surveys, and curriculum documentation, but not topic-level performance against a national peer set answering the same formative questions. Across 20 anonymized US medical schools, 11 of 16 organ systems vary beyond chance on identical content, with a between-school standard deviation of about 5 percentage points per system after sampling noise is removed. This is the first cross-institutional, topic-level, formative-content benchmark at scale in US medical education.

Anonymized cross-school analysis · 20 randomly sampled US/PR medical schools (drawn from the larger represented-school set) · 16 NBME organ systems on identical Ora qbank content · drawn from Ora's production database. No school is named, ranked, or shown as an individual cell.

20 Peer schools in
the benchmark

16 Systems measured on
identical questions

~5 pp Between-school SD
after noise removal

11 / 16 Systems with real
school-level signal

How much do schools differ on identical content?

20 schools · 16 systems · same qbank

Cross-school variation, system by system

Standard deviation of school-level accuracy on identical content. The dashed line marks the spread expected from finite samples alone.

Renal

Endo

Cardio

Neuro

Resp

Raw cross-school accuracy range spans ~21–32 pp across systems, but much of that is sampling variation. Per-system noise expectation 5.2–6.2 pp (dashed line = median ~5.6 pp). Six representative systems shown of 16.

Variation that survives the noise check

Between-school SD after subtracting the sampling-noise expectation: the share of variation attributable to the school, not to small samples.

Renal

Endo

Cardio

Neuro

Resp

Permutation null, 5,000 relabelings per system. The between-school component is distinguishable from sampling noise in 11 of 16 systems (p<0.05), strengthening to 13 of 16 at a stricter per-school stability floor.

Key finding

On identical formative content, the school-level signal is statistically meaningful in 11 of 16 organ systems. After removing the spread expected from finite samples, the between-school standard deviation is about 5 percentage points per system, with the largest noise-adjusted gaps in Renal (6.1 pp), Endocrine (5.6 pp), and Cardiovascular (5.2 pp). This is new information curriculum committees do not currently receive from any standard report.

The missing report

Medical schools already receive data, but the standard views leave a sharp blind spot. NBME's USMLE school reports are private to each school and centered on licensing performance, not precise topic-by-topic formative benchmarking against other schools answering the same questions.¹ The AAMC Graduation Questionnaire tells schools what students report about their education.² LCME accreditation documents whether a curriculum meets required standards and covers expected domains.³ None of these tells a curriculum committee whether its students are unusually strong or weak in Renal, Endocrine, Cardiovascular, or Respiratory performance relative to a national peer set on identical formative content.

That is the gap this analysis fills. It treats Ora's shared question bank as a common measurement surface: same organ-system taxonomy, same formative content, and anonymized school-level performance across 20 verified US/PR medical schools. The result is not a leaderboard and not a school-quality label. It is a curriculum signal. Every school in the sample has both stronger and weaker systems, which is exactly why the view matters. A dean does not need another pass-rate summary to ask better questions. They need to see where their students sit, system by system, against the national picture.

The largest noise-adjusted gaps cluster in core preclinical systems (Renal, Endocrine, and Cardiovascular) and can reflect curriculum sequence, cohort timing, study behavior, user mix, or genuine instructional strengths. The point is simpler: without this kind of formative benchmark, the committee cannot see the pattern at all.

How the benchmark was built

School set

20 schools, randomly drawn with a pre-registered seed from verified US/PR schools meeting a fixed response floor, not selected as top-N usage sites.
Anonymized as letters. The brief publishes no name, count, geography, size, institution type, or adoption timing.
16 organ systems measured on shared Ora qbank content using the standard NBME-blueprint depth-2 taxonomy.

Performance view

First-attempt, submitted, graded responses; per-school accuracy pooled per system.
Cross-school dispersion computed as SD and range per system. No per-school value, cell, or heatmap is reported.
Headline metric is the noise-adjusted between-school SD and the median-system spread.

Noise check

Permutation null with 5,000 relabelings per system estimates the spread expected from finite samples; the reported signal is observed minus that expectation.
Threshold sensitivity across 50 to 300 responses shows the real signal stabilizing near 5 pp and strengthening as estimates stabilize.
Descriptive, not causal.

Limitations

This is an observational, descriptive analysis, and the sample is not representative: the schools are drawn from those with substantial Ora engagement, and within each school Ora users self-select rather than representing the full student body. Cross-school differences may partly reflect who opts in, including year in program and study intensity, rather than the school itself. The ~5 pp residual survives a sampling-noise check but is not adjusted for incoming-student characteristics such as MCAT or GPA, which are not in the data. "Identical content" means the shared Ora corpus; per-user content filters may surface different subsets within a system. To guarantee no school is identifiable, the brief reports only aggregate cross-school distributions. No school is named, ranked, counted, or shown as an individual cell, which necessarily limits the resolution a reader can inspect. The variation documented here is measured-on-Ora performance on specific systems, not a claim about overall medical-education quality; all schools in the sample train competent physicians.

References

National Board of Medical Examiners. 2022 USMLE School Report Enhancements and Commonly Asked Questions: Enhanced USMLE School Score Reports (Jan 2023). nbme.org/2022-usmle-school-report-enhancements
Association of American Medical Colleges. Medical School Graduation Questionnaire (GQ): All Schools Summary Report. aamc.org/…/graduation-questionnaire-gq
Liaison Committee on Medical Education. Functions and Structure of a Medical School: Standards for Accreditation. lcme.org/publications