Ora's Physician-Reviewed AI Flashcards Have 449 Fewer Medical Errors than the Student-Crowdsourced AnKing Deck.

Ora AI Research Flashcard Content Currency

Research · Flashcard Content Currency

Ora’s Physician-Reviewed AI Flashcards Have 449 Fewer Medical Errors than the Student-Crowdsourced AnKing Deck.

Ora AI Research Team. Independent random samples, blinded multi-model AI screening, strict top-tier literature adjudication.

A blinded audit on independently-sampled flashcards (counting only items directly contradicted by a top-tier journal, current major-society guideline, or authoritative textbook) finds Ora's in-production flashcards at a 0.64% strict-confirmed error rate vs AnKing at 1.96%: a 3.07× ratio (Δ 1.32 pp; p ≈ 0.000056). The audited AnKing snapshot includes superseded content dating up to 19 years before the audit. Ora's regeneration pipeline closes the feedback loop in days; the 15 strict-flagged in-production cards are queued for physician review.

Sample drawn from the production deck snapshot used in the audit. Image-only cards excluded symmetrically. Ora denominator excludes 153 already-suspended cards; AnKing denominator is the full sample. Strict bar = top-tier journals, current major-society guidelines, or authoritative textbooks only.

3.07× AnKing vs Ora
strict-error-rate ratio

0.64% Ora flashcards
(15 of 2,347 in-production)

1.96% AnKing flashcards
(49 of 2,498)

up to 19y Years of guideline drift
in audited AnKing cards

Strict literature-confirmed error rate, with guideline-drift evidence

In-production analysis · n = 2,347 Ora, 2,498 AnKing

Strict-literature error rate by arm

Cards strict-confirmed as wrong by a top-tier journal, current major-society guideline, or current authoritative textbook. Ora denominator is in-production (non-suspended) cards only.

Ora (in-production)n = 2,347
15 strict-confirmed

AnKingn = 2,498
49 strict-confirmed

Absolute Δ 1.32 pp (95% CI 0.69–1.95); ratio 3.07×; p ≈ 0.000056. Extrapolated to AnKing's full ~34,000-card deck at the audited rates, the 1.32 pp gap projects to roughly 449 more strict-literature-contradicted cards than Ora would carry at its 0.64% rate (1.96% × 34,000 ≈ 666 vs 0.64% × 34,000 ≈ 217). Three of the 18 originally strict-flagged Ora cards had already been suspended by Ora's QA pipeline before the audit; the remaining 15 are in physician-review queue.

Guideline drift in the audited AnKing snapshot

Each row: an audited AnKing card contradicted by a major-society guideline or top-tier journal. Drift = years between the source and the audit date.

CHD endocarditis prophylaxis

AHA 2007
19y drift

Acanthamoeba vs Naegleria

CDC MMWR 2007
19y drift

TIA < 24h definition

AHA 2009
17y drift

Stable splenic injury → surgery

EAST 2012
14y drift

Cellulitis → routine skin culture

IDSA 2014
12y drift

Aspergillus echinocandin primary

IDSA 2016
10y drift

Burns > 20% → 1000 mL/hr LR

ABLS 2018
8y drift

Platelets < 100k always symptomatic

ASH 2019
7y drift

MDS → AML described as “rare”

ESMO 2021
5y drift

Selected examples; bar widths scale to 19 years. 35 of 49 AnKing strict-confirmed cards (71%) cite a current major-society guideline, regulatory labeling, or scientific statement as the contradicting source. Ora's 15 in-production strict-confirmed cards are first-generation factual issues, not guideline-supersession.

Key finding

Under the strictest literature bar (top-tier journal, current society guideline, or current authoritative textbook only), Ora's in-production flashcards carry 3.07× fewer factual errors than AnKing in the audited snapshots. 71% of AnKing's strict-confirmed errors are contradicted by a current major-society guideline or scientific statement, with documented drift up to 19 years despite continuous community editing. Ora's 15 remaining in-production strict-confirmed cards are queued for physician review on a sub-week turnaround.

Why crowdsourced editing lags. Why an AI pipeline doesn't.

AnKing is the dominant shared deck in US medical education: roughly 34,000 cards (AnKing v12) revised through an AnkiHub workflow where students propose edits, the community reviews, and the maintainer publishes a release. The structural cost is latency: a new guideline must reach a contributor's attention, be translated into a card edit, survive review, and propagate through release cycles. Multi-year drift is the predictable consequence.

Ora's flashcards are drafted by a physician-trained AI from a continuously refreshed literature corpus and regenerated against current standards on a sub-week cadence. New guidelines flow into card content without waiting for a contributor to notice, propose, review, and release. The direction survives every sensitivity check: each grader independently flagged Ora at a lower rate, the ratio holds in both AnKing-mapped and unmapped subsets and across 14 of 18 NBME topics, and the strict-bar funnel tightens the headline at each layer: both rates fall, the gap widens.

Method

Sample

Two independent random samples drawn at the variant level from each corpus, stratified across 18 depth-2 NBME organ-system topics; image-only cards excluded symmetrically.
N. 2,500 Ora + 2,498 AnKing originally drawn. Headline denominator excludes 153 already-suspended Ora cards (not in circulation), yielding 2,347 in-production Ora vs 2,498 AnKing.

Four-layer pipeline

Layer 1. Each card independently graded by three blinded subagents (Claude Opus 4.8, GPT 5.5, Gemini 3.1 Pro). Any-one-flag advances.
Layer 2. Each flagged card adjudicated against PubMed-indexed primary literature, current society guidelines, and authoritative references. Binary ruling.
Layer 3 (audit). Twelve parallel subagents re-judged every Layer-2 confirmed error against a four-bucket rubric (real / guideline-update / pedantic / false alarm).
Layer 4 (strict). Re-verified every retained real error against a top-tier-only bar: NEJM/JAMA/Lancet-tier journals, current major-society guidelines, or current authoritative textbooks. Default: card stands unless rock-solid contradiction.

Outcome

Primary. Share of each arm with a Layer-4 strict-confirmed error. Two-proportion z-test; Wald 95% CIs.
Sensitivity. Re-ran headline under each Layer-1 model as sole screener and on the AnKing-mapped-only subset. Direction unchanged in every case.
Falsification. Probed sub-corpus concentration, per-grader asymmetry, topic reversals, deck-version drift, and tertiary-source leakage (zero observed).

Limitations

This is a point-in-time snapshot against current published standards, not a permanent claim about either deck. Both decks evolve: Ora regenerates against a refreshed corpus; AnKing reflects the community release current at audit time. Neither deck is error-free, and the absolute Ora rate (0.64%) is not zero. The Ora denominator excludes already-suspended cards; AnKing has no platform-level suspension flag. Three of 18 NBME topics reverse the headline (Nervous System the only reversal on a non-trivial denominator), reported transparently rather than excluded. Layer-4 strict verdicts are conservative (default-defended) and applied symmetrically across arms.

References

Lu M, Farhat JH, Beck Dallaghan GL. Enhanced Learning and Retention of Medical Knowledge Using the Mobile Flash Card Application Anki. Med Sci Educ. 2021;31(6):1975–1981. doi:10.1007/s40670-021-01435-w
American Heart Association. Diagnosis, Workup, Risk Reduction of TIA in the Emergency Department: AHA Scientific Statement. Stroke. 2023.
Wilson W, et al. Prevention of Infective Endocarditis: AHA Guidelines. Circulation. 2007;116(15):1736–1754.
CDC. Acanthamoeba Keratitis — Multiple States, 2005–2007. MMWR. 2007;56(21):532–534.
National Board of Medical Examiners. Constructing Written Test Questions (NBME Item-Writing Guidelines). nbme.org IWG Gold Book
AnkiHub Community. AnkiHub's Role in the AnKing Step Deck (community change-note process). community.ankihub.net