Ora AI's Active-Recall Videos Lift QBank Accuracy Up to 4.4 Points on 80,000 Same-Topic Responses.

Ora AI Research Videos Made Interactive

Research · Videos Made Interactive

Ora AI’s Active-Recall Videos Lift QBank Accuracy Up to 4.4 Points on 80,000 Same-Topic Responses.

Ora AI Research Team. A first-attempt QBank-accuracy lift associated with completing topically-linked Ora videos.

Across an analytic sample of 80,238 first-attempt QBank responses, watching the Ora video on a vignette’s topic raised first-attempt accuracy from 61.1% to 63.7%, a headline +2.66-point lift. The lift grows with dose to +4.41 pts after three same-topic videos and with elapsed time to +4.05 pts at 30+ days, consistent with durable retention. It survives a within-student paired analysis (+1.64 pts, 95% CI 0.02–3.26), ruling out “better students watch more videos.” Ora’s videos carry 1,420 mid-video active-recall questions across 97% of the catalog, grounded in the interpolated-testing literature.

Drawn from Ora's production database. First-attempt vignette responses only; each response counted once even when its vignette is linked to multiple lectures. “Same-topic” = video and vignette share a parent lecture in the curriculum map.

+2.66 pts First-attempt QBank
accuracy lift on same-topic items

+4.41 pts Dose-response gap
(0 vs 3+ same-topic videos)

+4.05 pts Durable lift at 30+ days
after the video watch

1,420 Embedded recall questions
across 355 of 366 videos

Dose-response and retention shape

Analytic sample · n = 80,238 first-attempt responses

Dose-response: accuracy by same-topic videos completed

Per response, the number of distinct videos the student had completed on the vignette’s parent lectures.

0 videos57,248 responses

1 video10,759 responses

2 videos6,038 responses

3+ videos6,193 responses

Monotonic across all four dose levels. 0 vs 3+ gap = +4.41 pts. First-attempt only.

Retention shape: accuracy by time since the most recent same-topic video

Lag in days between the most recent completed same-topic video and the response.

No video62,897 resps

Same day802 resps

1–6 days2,271 resps

7–29 days5,089 resps

30+ days9,179 resps

The lift grows with elapsed time, peaking at +4.05 pts at 30+ days. The same-day reversal reflects selection bias toward weak topics.

Within-student paired check: controls for student-ability confounding

Restricted to a 117-student paired-analysis subset meeting the ≥10-response threshold in both arms, each student’s own accuracy gap was computed and averaged. Within-student mean lift = +1.64 pts (95% CI: 0.02–3.26, p ≈ 0.049); 58.1% individually scored higher on same-topic items where they had completed a video first. The same student does better on topics where they previously watched a same-topic video than on topics where they did not. The effect direction holds; magnitude shrinks under the stricter design.

What the brief says (and does not say)

This brief reports an associational pattern, not a causal effect: students self-select into videos based on their topic-by-topic strengths and Ora’s scheduler. The within-student paired analysis rules out the most obvious confounder (better students watch more videos and score higher in general), but residual within-student confounding remains: a student may watch a video on a given topic because they have more time or attention available that day, and that same conscientiousness may carry over into the subsequent vignette. The same-day reversal (−4.98 pts vs no-video baseline) is consistent with students preferentially attempting QBank items on topics they just watched because they know they’re weak there, and is reported transparently rather than excluded.

A calibration check straddling the interactive-question layer rollout found the lift was +4.5 pts before and +4.5 pts after, essentially unchanged. We therefore cannot attribute additional lift to the interactive-question layer specifically with the current data; the layer is reported here as the operationalized design (grounded in the testing-effect literature), not as the measured source of the lift.

Method, intervention, and substrate

Analysis design

Unit. First-attempt vignette response, deduped: each response counted once even when its vignette links to multiple lectures.
Exposure. At least one completed video watch on a video that shares a parent lecture with the vignette, with completion before the response.
Within-student check. Per-user accuracy difference between arms, restricted to users with ≥10 responses in each; paired-t inference.

The intervention

Interactive recall questions. Four-option multiple-choice prompts that auto-pause playback at predetermined timestamps.
Coverage. 1,420 questions across 355 of 366 videos (97%); median 4 per video.
Library. 290 Osmosis (CC BY-SA 4.0); 58 Anatomy (Ora-produced, VOKA visuals); 18 Ora original.

Data substrate

Video activity. Voluntary video-watch events from the analytic sample; completion timestamped before the linked vignette response.
Vignette responses. Analytic sample of 80,238 first-attempt responses on vignettes with at least one lecture link.
Link graph: 19,953 video↔lecture edges × 140,054 vignette↔lecture edges over 7,279 lectures spanning both modalities.

Limitations

Observational, not randomized; selection bias is the primary threat to interpretation, and the within-student analysis mitigates but does not eliminate it. The link graph is lecture-based (videos and vignettes joined through their shared parent lecture), which is coarser than a direct video↔vignette link; direct cross-modal linking is on the roadmap, so this lecture-mediated signal is the best currently available rather than the cleanest possible one. The same-day reversal (−4.98 pts vs no-video baseline) reflects student topic-selection behavior, not a harmful video effect. The calibration check straddling the interactive-question layer rollout finds no additional lift attributable to that layer specifically; the layer’s mechanism evidence is the cited testing-effect literature, not these data. Per-question response capture is on the instrumentation roadmap and will enable a sharper analysis in a future iteration.

References

Szpunar KK, Khan NY, Schacter DL. Interpolated memory tests reduce mind wandering and improve learning of online lectures. Proc Natl Acad Sci USA. 2013;110(16):6313–6317. doi:10.1073/pnas.1221764110
Schacter DL, Szpunar KK. Enhancing attention and memory during video-recorded lectures. Scholarship of Teaching and Learning in Psychology. 2015;1(1):60–71. doi:10.1037/stl0000011
Roediger HL III, Karpicke JD. The power of testing memory: basic research and implications for educational practice. Perspectives on Psychological Science. 2006;1(3):181–210. doi:10.1111/j.1745-6916.2006.00012.x
Adesope OO, Trevisan DA, Sundararajan N. Rethinking the use of tests: a meta-analysis of practice testing. Review of Educational Research. 2017;87(3):659–701. doi:10.3102/0034654316689306
Osmosis × Wiki Project Med Foundation. Videos from Osmosis on Wikimedia Commons, released under Creative Commons Attribution-ShareAlike 4.0 International. commons.wikimedia.org/wiki/Category:Videos_from_Osmosis