Ora's AI Tutor Answers Medical Student Questions More Accurately than Frontier Models GPT 5.5 (OpenAI), Claude Opus 4.8 (Anthropic), and Gemini 3.1 Pro (Google).
Ora AI Research AI Safety & Grounding
Research · AI Safety & Grounding

Ora's AI Tutor Answers Medical Student Questions More Accurately than Frontier Models GPT 5.5 (OpenAI), Claude Opus 4.8 (Anthropic), and Gemini 3.1 Pro (Google).

Ora AI Research Team. Blinded multi-model evaluation of production chat traffic.

We pulled 200 real, anonymized student questions (about the specific vignette, flashcard, article, or video each student had open) together with the Ora replies those students actually received, and re-ran the same questions against three ungrounded frontier models (GPT 5.5, Claude Opus 4.8, Gemini 3.1 Pro). A blinded three-model panel then scored all 800 responses on five dimensions. Ora scored 4.5/5 on contextual relevance versus 3.5 for the ungrounded models and 4.6 vs 4.2 on pedagogical fit; factual accuracy was near-ceiling and indistinguishable across all four; none of the 800 responses had a confirmed safety issue. Grounding's benefit is in answering the student's actual item, not raw correctness.

Sampled from 8,631 content-grounded conversations (73% of grounded chat in the C-2 sample), stratified 50 each across vignette, flashcard, article, and video. Ora responses are the verbatim production replies students received. N = 200 queries × 4 systems × 3 blinded model graders × 5 dimensions = 12,000 ratings.
4.5 vs 3.5 Contextual relevance
Ora vs ungrounded LLMs (/5)
4.6 vs 4.2 Pedagogical fit
Ora vs ungrounded (/5)
0 Confirmed safety issues
across all 800 responses
12,000 Blinded ratings
(200 × 4 × 3 × 5)
Where grounding moves the needle, and where it doesn't
Blinded 3-model panel · consensus score per dimension
Rubric score by dimension
Ora (grounded) vs the average of three ungrounded frontier LLMs, on a 1 to 5 scale. The gap concentrates in contextual relevance and pedagogical fit.
Factual
Citation
Safety
Pedagogy
Relevance
Ora (grounded)Ungrounded LLMs
Contextual relevance by content type
The grounding advantage holds across every content type, and is largest for vignettes and videos, items a model cannot see.
Flashcard
Vignette
Article
Video
Ora (grounded)Ungrounded LLMs

What grounding changes

The procurement question for medical-education AI is "is this safe in our students' hands?" The consumer question is "why pay for Ora's AI when ChatGPT is free?" Both resolve at one point: does grounding a response in a curated, physician-built corpus1 beat an unaugmented frontier model on the same student query? The honest answer is nuanced.

Frontier models are already strong on direct medical knowledge. Our three ungrounded arms scored a mean 4.8/5 on factual accuracy, indistinguishable from Ora and consistent with the published ceiling for these models.2,4 No system produced a single confirmed safety issue across 800 responses, after a blinded adjudicator applied an expert-physician rubric plus guideline search to every flagged case. For the institutional buyer, the reassuring headline: on this corpus, grounded and ungrounded answers alike were clinically safe.

Grounding's measurable edge is elsewhere. Asked about a specific item, the ungrounded models gave correct but generic answers that talked past it; Ora's reply engaged the actual vignette stem, flashcard, article, or video. All three graders independently scored Ora about a full point higher on contextual relevance (4.5 vs 3.5) and higher on pedagogical fit (4.6 vs 4.2), with strong agreement (quadratic-weighted kappa 0.80). The gap held across all four content types and was largest for vignettes and videos, the case or clip an ungrounded model cannot see.

Bottom line

On real student queries, ungrounded frontier LLMs are factually strong and, here, clinically safe. What they cannot do is see the specific item a student is studying. Grounding raises contextual relevance from 3.5 to 4.5 and pedagogical fit from 4.2 to 4.6, across every content type. Ora's grounded chat earns its keep answering the student's actual question, not by out-scoring a frontier model on facts it already knows.

Method

Eval set
  • Source. 200 first-turn student questions, sampled deterministically from 8,631 content-grounded conversations, 50 each across the four content types.
  • Anonymization. Names, schools, dates, URLs, and first-person clinical referents stripped; all 200 manually reviewed before any text left Ora.
  • Ora arm. The verbatim production reply each student received, grounded in the linked item.
Comparators
  • Three frontier models (GPT 5.5, Claude Opus 4.8, Gemini 3.1 Pro) answered each query with no system prompt and no grounded content: the unaugmented use a student gets from a chatbot.
  • Access. Via Cursor-bundled models, not vendor APIs; behavior may differ slightly. Slugs snapshotted at evaluation time.
Scoring
  • Layer 1. Three models, blinded to source, scored all 800 responses on five 1 to 5 dimensions plus issue flags.
  • Layer 2. Flagged factual, safety, and citation cases went to a separate blinded adjudicator (Claude Opus 4.8, literature search) for a binary ruling; relevance and fit are raw consensus scores.
  • Pre-registered rubric, locked before scoring.
Limitations

This compares Ora's full production stack against unaugmented API-style calls, not against an alternative grounding method. Citation quality was uniformly weak across all four systems (1.9/5) with near-zero grader agreement, so it is not a differentiator. Safety was assessed by rubric, not student outcomes; "no confirmed issue" is not "no possible harm."3 The eval set is Ora users' queries and may not represent all medical-student AI use, and comparator behavior may change as frontier models advance.

References

  1. Zakka C, Shad R, Chaurasia A, et al. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI. 2024;1(2). doi:10.1056/AIoa2300068
  2. Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023;6(10):e2336483. doi:10.1001/jamanetworkopen.2023.36483
  3. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233-1239. doi:10.1056/NEJMsr2214184
  4. Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025. doi:10.1038/s41591-024-03423-7