The Spacing Effect in Theory and Practice: A Narrative Review of Mechanisms, Algorithms, and the Research–Implementation Gap#

Authors: Retentio Research (research@retentio.app) · Valor (Valor@yuda.me)

Document type: Narrative literature review
Scope: Cognitive psychology, neuroscience, second-language acquisition, and learning-technology design

Abstract#

The spacing effect—the superiority of distributed over massed practice for long-term retention—is among the most replicated findings in experimental psychology, with evidence spanning molecular neuroscience, behavioral meta-analyses, and large-scale field studies. Despite effect sizes favoring spaced practice over cramming by approximately 74% in meta-analytic syntheses (Cepeda et al., 2006) and distributed practice ranking highest among ten learner-controlled study techniques (d = 0.85; Donoghue & Hattie, 2021), formal education and consumer learning applications have largely failed to implement spacing systematically. Recent work strengthens this picture while complicating simple generalizations: a 2026 medical-education meta-analysis of 21,415 learners reported a standardized mean difference of 0.78 favoring spaced repetition (Maye & Hurley, 2026), whereas a 2025 mathematics meta-analysis found a smaller but robust spacing effect (g = 0.28; Murray et al., 2025), and a 2025 classroom-applied review reported d = 0.54 in ecologically valid settings (Mawson & Kang, 2025). This review synthesizes four decades of research on spaced repetition and its computational instantiations, with emphasis on publications from 2024–2026, organized around six themes: (1) biological and neural mechanisms of spacing; (2) behavioral effect sizes, moderators, and optimal interval schedules; (3) the evolution of scheduling algorithms from SuperMemo SM-2 to Free Spaced Repetition Scheduler (FSRS); (4) the recognition–production gap in second-language acquisition; (5) barriers to adoption in institutional and consumer contexts; and (6) tensions between engagement-oriented product design and learning-science evidence. A central finding across domains is that prediction accuracy of modern schedulers does not necessarily translate into superior learning outcomes, that flashcard-based spaced repetition primarily builds receptive knowledge with limited far transfer to productive language use, and that metacognitive misalignment—learners preferring strategies that feel effective over those that are effective—explains much of the persistent research–practice gap. Directions for future research and design are discussed.

Keywords: spaced repetition, spacing effect, distributed practice, retrieval practice, forgetting curve, spaced repetition system, FSRS, second-language acquisition, learning technology

1. Introduction#

When identical learning trials are separated by rest intervals rather than presented in immediate succession, retention improves dramatically—a phenomenon documented as early as Ebbinghaus's (1885/1913) studies of nonsense-syllable memory and subsequently replicated across species, materials, and retention intervals. Contemporary researchers describe the spacing effect as "one of the most robust phenomena in experimental psychology" (Santoro, 2021). Computational spaced-repetition systems (SRS), including Anki, SuperMemo, and commercial language applications, operationalize this principle by scheduling review at expanding intervals calibrated to individual forgetting curves.

Yet a paradox persists. Laboratory and meta-analytic evidence strongly favors spacing, while real-world outcomes remain disappointing: education applications exhibit among the lowest user-retention rates of any mobile app category (2% at Day 30; Business of Apps, 2026), fewer than 0.1% of Duolingo's reported user base completes a course (London Now, 2021), and formal schooling has largely ignored distributed review for decades (Dempster, 1988; Dunlosky et al., 2013b). This review examines why empirically supported spacing practices fail to scale, integrating evidence from neuroscience, cognitive psychology, algorithm engineering, applied linguistics, and learning-product economics.

The present synthesis draws primarily on peer-reviewed meta-analyses, primary experiments, and neuroimaging studies, with supplementary practitioner literature where controlled research is sparse. A targeted forward search identified additional studies published or accepted between January 2024 and March 2026.

2. Methods#

This narrative review follows a thematic synthesis approach (Popay et al., 2006). Sources were identified through the reference lists of three anchor syntheses—Cepeda et al. (2006, 2008), Dunlosky et al. (2013a), and Donoghue and Hattie (2021)—and extended through forward citation of scheduling-algorithm literature (Wozniak, 2018; Ye et al., 2024) and a supplementary search of PubMed, Web of Science, and Cambridge Core for publications from 2024 onward using terms including spaced repetition, distributed practice, spacing effect, and FSRS. Inclusion criteria prioritized meta-analyses, controlled experiments, and neuroimaging studies with direct relevance to spacing intervals, retrieval practice, or SRS implementation. Industry reports, forum discussions, and algorithm documentation were included only for sections on commercial implementation where peer-reviewed evidence is limited. No formal quality scoring was applied; effect sizes are reported as published by source authors.

3. Results#

3.1 Neural and Molecular Mechanisms#

Spacing effects are observable at multiple levels of biological organization. At the molecular level, cAMP response element-binding protein (CREB) functions as a transcription factor that determines whether synaptic changes consolidate into long-term memory (Santoro, 2021). In Drosophila, ten odor–shock pairings produce approximately three days of avoidance when massed, but seven or more days when separated by fifteen-minute intervals—a substantial duration relative to the fly's fifty-day lifespan. Genetically overexpressing CREB causes massed training to produce long-term memory, indicating that CREB activation is a rate-limiting step that spacing accommodates (Naqib et al., 2012; Philips et al., 2013).

Mitogen-activated protein kinase (MAPK) provides a complementary timing mechanism: activation peaks approximately forty-five minutes post-training, defining a window in which subsequent trials reinforce prior learning. Four spaced three-minute stimulations with ten-minute rest intervals produce persistent MAPK activation; a single twelve-minute pulse does not (Naqib et al., 2012). Similar constraints appear in sea slugs and rodents, suggesting evolutionary conservation (San Martin et al., 2017).

At the cognitive level, spaced training increases opportunities for retrieval practice—both explicit and implicit—and strengthens the neural pathways supporting memory (Santoro, 2021). Feng et al. (2019) demonstrated via fMRI that spaced learning enhances episodic memory by increasing neural pattern similarity across repetitions, consistent with reactivation of a stable memory trace rather than encoding of independent episodes.

At the systems level, the hippocampal–cortical transfer model posits that the hippocampus rapidly encodes new memories, which are gradually redistributed to neocortical networks during sleep via sharp-wave ripples (Dudai et al., 2015). A recent fMRI study involving 48 participants compared three-day spaced learning with one-day massed learning (Yang et al., 2025). Immediate performance did not differ between groups, but spaced learners showed superior retention at one week and one month. Neural pattern similarity in default mode network (DMN) subsystems—dorsal-medial and medial-temporal—during immediate retrieval predicted one-month retention. Spaced learning additionally increased neural replay of durable memories in dorsal-medial DMN during rest, whereas massed learning showed replay confined to the hippocampus. These findings suggest that day-scale spacing facilitates cortical consolidation mechanisms not captured by massed encoding.

Josselyn (2021, as cited in Santoro, 2021) cautions that integration across levels of analysis—from gene expression to cognition—remains incomplete; nevertheless, spacing requirements appear consistently across molecular, cellular, and systems measures.

Complementing hippocampal–DMN findings, Zou et al. (2025) reported that spaced learning increased representational similarity in ventromedial prefrontal cortex (vmPFC), and that these neural similarity increases paralleled behavioral spacing benefits. Critically, spacing effects depended on successful retrieval and subsequent re-encoding of prior encounters—spacing benefits were substantially reduced when participants failed to retrieve earlier presentations. This is consistent with a re-encoding account: inter-session intervals are beneficial not merely because the brain rests, but because successful retrieval during a later encounter updates and strengthens the existing memory trace (Zou et al., 2025; Chan et al., 2025). Together with Yang et al.'s (2025) DMN replay data, recent neuroimaging studies increasingly support cortical consolidation and representational stabilization as mechanisms distinguishing spaced from massed learning at day-scale and longer intervals.

3.2 Behavioral Evidence and Effect Sizes#

3.2.1 Meta-analytic findings#

Cepeda et al. (2006) synthesized 839 assessments from 317 experiments. Spaced presentations surpassed massed presentations across retention intervals ranging from under one minute to over thirty days; only 4.4% of comparisons favored massing. Spacing yielded approximately 74% better retention than cramming. Cepeda et al. (2008) further demonstrated that optimal inter-study intervals scale with intended retention interval: for one-week retention, optimal gaps approximate 20–40% of the retention interval; for one-year retention, 5–10%. Performance rises with interval length to an optimum, then declines modestly, while absolute performance decreases as retention intervals lengthen.

Donoghue and Hattie (2021) quantified Dunlosky et al.'s (2013a) ten learning techniques across 242 studies, 1,619 effects, and 169,179 participants. Distributed practice (d = 0.85) and practice testing (d = 0.74) ranked highest; summarization (d = 0.44) and underlining (d = 0.44) ranked lowest among techniques studied (see Table 1).

Table 1. Effect sizes for selected learning techniques (Donoghue & Hattie, 2021)

Technique	Cohen's d	Dunlosky et al. (2013a) classification
Distributed practice	0.85	High utility
Practice testing	0.74	High utility
Elaborative interrogation	0.56	Moderate utility
Summarization	0.44	Low utility

Important moderators include ability level (lower-ability students: d = 0.47; higher-ability: d = −0.11), transfer distance (near: d = 0.61; far: d = 0.39), and outcome depth (surface: d = 0.60; deep: d = 0.26). A methodological caveat applies: 93% of included studies measured surface learning, and 74% tested within one day, limiting generalization to conceptual understanding and long-delay retention.

3.2.2 Domain-specific and applied meta-analyses (2024–2026)#

Recent meta-analyses suggest that spacing effect sizes vary meaningfully by domain and ecological context (see Table 2).

Table 2. Selected recent meta-analytic effect sizes for spaced versus massed practice

Domain / context	Effect size	N (learners)	Source
General learning techniques	d = 0.85	169,179	Donoghue & Hattie (2021)
Medical education (objective tests)	SMD = 0.78	21,415	Maye & Hurley (2026)
Classroom-applied research	d = 0.54	>3,000	Mawson & Kang (2025)
Mathematics (spacing)	g = 0.28	27 studies	Murray et al. (2025)
Mathematics (isolated material)	g = 0.43	10 studies	Murray et al. (2025)
Mathematics (course-embedded)	g = 0.24	17 studies	Murray et al. (2025)

Maye and Hurley (2026) conducted a PRISMA-compliant systematic review of spaced repetition in medical education, screening 542 records and synthesizing thirteen studies. Interventions included faculty-built or third-party flashcard decks (including Anki), email-delivered MCQs, continuing-medical-education frameworks, and spaced classroom quizzes. The pooled standardized mean difference of 0.78 (95% CI 0.56–0.99) ranks among the largest domain-specific spacing effects reported, suggesting that SRS implementations in high-stakes professional education can produce substantial objective-test gains when adherence is structurally supported.

Murray et al. (2025) meta-analyzed twenty-seven spacing comparisons and seven testing-versus-restudy comparisons in mathematics. Spacing produced a robust small-to-medium overall effect (g = 0.28), with larger effects for isolated material (g = 0.43) than course-embedded learning (g = 0.24). The testing effect in mathematics was weaker and statistically uncertain (g = 0.18; 95% CI crossing zero), suggesting that retrieval practice may not generalize uniformly across domains—a finding with implications for SRS content beyond paired-associate vocabulary.

Mawson and Kang (2025) reviewed applied classroom research specifically, screening over 3,000 records and retaining twenty-two reports with thirty-one effect sizes. Distributed practice outperformed massed practice with d = 0.54 (95% CI 0.31–0.77), with larger effects associated with longer retention intervals, higher education levels, and fewer re-exposures. This review partially addresses the ecological-validity gap noted in Donoghue and Hattie (2021), confirming that spacing benefits persist outside decontextualized laboratory paradigms—though effect sizes remain smaller than in tightly controlled verbal-learning experiments.

3.2.3 Retrieval practice and combined effects#

Roediger and Karpicke (2006) showed that students who completed recall tests after studying prose passages outperformed restudying controls at one-week delay, despite lower self-rated confidence—a judgments-of-learning paradox wherein learners prefer strategies that feel productive over those that are effective (Koriat, 1997). Rowland (2014) reported a meta-analytic effect of d = 0.50 for testing over restudy, with larger effects for recall than recognition tasks.

Combining spacing and retrieval practice yields compounding benefits. Price et al. (2025) randomized 26,258 family physicians and residents across five spaced-repetition conditions using the American Board of Family Medicine Continuous Knowledge Self-Assessment. Spaced repetition outperformed no repetition for learning at six months (58.03% vs. 43.20%; Cohen's d = 0.62) and knowledge transfer at ten months (58.33% vs. 52.39%; d = 0.26). Double-spaced repetitions exceeded single-spaced repetitions for both learning (d = 0.43) and transfer (d = 0.20), though specific repetition-strategy variants within single- and double-spaced groups did not differ meaningfully—paralleling Kang et al.'s (2014) finding that schedule type matters less than spacing itself.

Kang et al. (2014) found that both equal-interval and expanding-interval schedules outperform massed practice for long-term retention, with no consistent advantage of expanding over fixed intervals—a result with direct implications for scheduler design.

Classroom replications confirm ecological validity. Kapler et al. (2015) found that spacing a review quiz eight days after a lecture—versus one day—improved five-week test performance in a simulated undergraduate setting, for both factual and higher-order items. Rogers et al. (2025) conceptually replicated Cepeda et al. (2008) in an online L2 vocabulary study, finding that all spaced intersession intervals (1–14 days) outperformed massed practice on a ten-day delayed posttest, with evidence of spacing but not lag effects—consistent with Kim and Webb (2022). However, online participants scored 10–20% lower than laboratory samples, and attrition reached 33%, highlighting methodological challenges for multisession online SRS research (Rogers et al., 2025; Rodd, 2024).

3.3 Scheduling Algorithms: From SM-2 to FSRS#

3.3.1 Historical development#

Computational spaced repetition originates in Piotr Wozniak's 1985 measurement of personal forgetting rates, culminating in SuperMemo Algorithm SM-0 with intervals of 1, 2, 4, 8, 16, and 32 days (Wozniak, 2018). Algorithm SM-2 (1987), which adjusts intervals via an easiness factor based on item difficulty, remains the scheduler underlying Anki and Mnemosyne. Wozniak's two-component memory model (1988)—distinguishing retrievability (current recall probability) from stability (memory durability)—anticipated formalizations in modern schedulers. Subsequent SuperMemo iterations (SM-5 through SM-17) introduced universal memory formulas, exponential forgetting curves, and refined stability-increase functions (Wozniak, 2018).

3.3.2 FSRS architecture#

The Free Spaced Repetition Scheduler (FSRS), developed by the open-spaced-repetition community, models three latent variables (Ye et al., 2024):

Retrievability (R): recall probability at time t, fitted by a power-law forgetting curve (FSRS v4 onward), which empirically outperforms exponential decay on large-scale review logs.
Stability (S): the interval at which R decays to 90%; when elapsed time equals S, R = 0.90 by definition.
Difficulty (D): a heuristic index (1–10) updated after each review based on response grade.

FSRS-6 optimizes twenty-one parameters via gradient descent on individual review histories, minimizing log-loss between predicted and observed recall. Benchmarks on 727 million reviews from approximately 10,000 Anki users report log-loss of 0.3460 for FSRS-6, 0.416 for SM-2, and 0.4694 for Duolingo's Half-Life Regression algorithm (Open Spaced Repetition, 2024; Expertium, 2025; Yudame Research, 2025). An independent benchmark across 9,999 Anki collections (~350 million reviews) reports FSRS-6 with recency weighting achieves lower log-loss than SM-2 for 99.6% of users (Expertium, 2025). FSRS-6 introduced a user-specific forgetting-curve shape parameter (w₂₀) and same-day review handling; practitioners report 20–30% fewer reviews needed to maintain equivalent retention after migrating from SM-2 (Open Spaced Repetition, 2025). Ye et al. (2024) published the underlying stochastic shortest-path formulation in ACM KDD and IEEE TKDE, framing spacing as an optimization problem over memory dynamics rather than a fixed heuristic.

The model encodes three principles with pedagogical significance (Expertium, 2024): (a) stability gains are maximal when retrieval succeeds at low R (the "desirable difficulty" window); (b) stability increases saturate with successive reviews and vary inversely with item difficulty; and (c) desired retention rate trades off against review workload—higher target retention yields shorter intervals.

3.3.3 Prediction versus learning outcomes#

A critical distinction separates prediction accuracy from learning efficacy. FSRS predicts recall probability more accurately than SM-2 (mean absolute percentage error approximately 12–33% depending on prediction target), yet no rigorous head-to-head trials demonstrate that algorithmic sophistication produces meaningfully superior retention over months or years in ecologically valid settings. Expanding intervals yield only approximately 3% better outcomes than fixed intervals in meta-analytic comparisons, and Kang et al. (2014) found no meaningful difference between schedule types for long-term retention. The evidence supports the conclusion that any reasonable spaced scheduler substantially outperforms massed practice, while marginal returns from algorithmic refinement remain unproven. Adoption barriers appear to center on workflow friction, onboarding, and content design rather than model capability (Dunlosky et al., 2013b).

3.4 The Recognition–Production Gap in Language Learning#

Kim and Webb (2022) meta-analyzed forty-eight experiments (N = 3,411) on spaced vocabulary practice, reporting large effect sizes (g = 1.04 with immediate feedback; g = 0.64–2.34 with delayed feedback). However, the majority of studies employed paired-associate learning—the standard flashcard format—and assessed outcomes in formats isomorphic to training. This limits inference about transfer to productive language use.

Recognition and production appear to be partially dissociable constructs. González-Fernández (2025), studying 314 EFL learners from Chinese and Spanish L1 backgrounds, used implicational and Mokken scaling to establish a reliable hierarchy in which recognition knowledge precedes recall knowledge across form–meaning links, collocations, multiple meanings, and derivatives—a sequence stable across L1 groups and accuracy thresholds. This finding extends González-Fernández and Schmitt's (2020) earlier work and provides empirical grounding for the recognition-before-production developmental sequence hypothesized in practitioner literature. Stewart et al. (2024) argue that lexical recall and recognition may constitute distinct psychometric constructs. Vocabulary size explains substantial variance in speaking proficiency (32–84% depending on conditions), yet large vocabularies do not guarantee lexically sophisticated production in speech.

Several theoretical frameworks explain this gap:

Proceduralization failure (DeKeyser, 2015): declarative knowledge built through flashcard review must undergo extensive production practice to become procedurally automatic; recognition review engages controlled processing incompatible with real-time conversational demands.
Transfer-appropriate processing (Morris et al., 1977): memory is strongest when encoding and retrieval processes match; flashcard recognition engages different processes than conversational production.
Context-dependent memory (Godden & Baddeley, 1975): material learned in one environmental context is recalled better in that context; interface-specific learning may fail to activate in conversational settings.
Absence of communicative pressure: SRS does not simulate the time constraints and cognitive load of simultaneous comprehension and production in dialogue.

Donoghue and Hattie's (2021) domain-specific effect size for distributed practice in languages (d = 0.39; cf. mathematics d = 1.16) confirms moderate effects on vocabulary recall with weaker evidence for generative language tasks. Application metrics (cards reviewed, retention percentages, streak counts) measure scheduler compliance rather than communicative competence, potentially creating an illusion of progress.

Table 3. Recognition versus production practice modes (synthesis after Yudame Research, 2025; González-Fernández, 2025; Stewart et al., 2024)

Mode	Low time pressure	High time pressure
Recognition (input)	Flashcard review at self-paced rates—where most SRS time is spent; builds declarative knowledge necessary but insufficient for fluency	Listening comprehension under time pressure; builds processing speed
Production (output)	Writing, sentence construction, journaling—bridges recognition to production without conversational load	Live conversation—requires automatized retrieval, pragmatic competence, and error tolerance

Most SRS implementations concentrate practice in the top-left cell; communicative fluency requires the bottom-right. The diagonal from passive recognition to active production under pressure is the path many self-directed learners fail to complete (Yudame Research, 2025).

Recent L2 research adds nuance to the flashcard-transfer debate. Nakata and Elgort (2021) found that spacing during contextual vocabulary learning from reading improved explicit knowledge (meaning recall and form–meaning matching) but not tacit semantic knowledge measured by priming—suggesting that spaced contextual exposure and spaced paired-associate drill may build different knowledge types. Serrano and Pellicer-Sánchez (2026), in an eye-tracking study of ninety-two bilingual learners, found that spaced and contextually varied reading conditions increased processing difficulty (more and longer fixations) without improving delayed vocabulary recall or recognition relative to massed reading of the same material. Immediate gains were actually higher under massed conditions, though they declined more sharply—illustrating that spacing during incidental reading is not universally beneficial when increased difficulty fails to produce successful retrieval (Bjork & Bjork, 2011). These findings caution against assuming that any spaced exposure—particularly without retrieval success—automatically produces durable learning.

3.5 Practitioner Integration and Card Design#

Expert language learners converge on SRS as a supplementary tool rather than a primary acquisition method, typically allocating 10–30% of study time to spaced review (Refold, n.d.; practitioner reports cited in Kim & Webb, 2022; Yudame Research, 2025). Polyglot practitioners articulate divergent but overlapping prescriptions: Steve Kaufmann (LingQ) treats SRS as optional relative to massive comprehensible input; Luca Lampariello reports SRS use only for narrow needs, preferring contextual re-exposure; Gabriel Wyner (Fluent Forever) centers SRS but pairs it with pronunciation-first sequencing and multimodal cards that avoid translation dependence (Yudame Research, 2025). Shared principles include daily consistency over session length, personally mined cards over premade decks, and treating SRS as a vocabulary floor rather than a complete acquisition system.

Refold-derived heuristics suggest roughly 30–40% of study time to SRS for beginners, 20–30% for intermediates, and 10–15% for advanced learners—ratios that are practitioner-derived rather than trial-validated but align with the input-plus-SRS model (Refold, n.d.; Yudame Research, 2025). A twelve-week integration protocol distilled from practitioner convergence recommends: weeks 1–3, SRS-heavy foundation (10–15 new cards/day, pronunciation and high-frequency vocabulary); weeks 4–6, ramp comprehensible input and mine 1T sentences from authentic content; weeks 7–9, moderate SRS to ~10 new cards/day and shift freed time to input and low-pressure output (shadowing, writing); weeks 10–12, integrated phase (~15–20% SRS, sessions capped at fifteen minutes, emphasis on conversation) (Yudame Research, 2025).

Controlled research on optimal time allocation remains sparse; however, a meta-analysis of twenty-one extensive reading studies (N = 1,268) reported d = 1.32 for vocabulary gains from reading alone (Liu & Zhang, 2018)—comparable to SRS effect sizes.

Chan et al. (2025) tested competing accounts of spacing mechanisms. Students learning calculus differentiation rules in spaced versus massed sessions showed spacing benefits without measurable working memory depletion, favoring unconscious mental rehearsal over rest-and-recovery explanations. Hendrick (2025) reframes spacing as a scheduling decision rather than a study strategy, implying that activities during inter-session intervals may modulate spacing effectiveness—a finding relevant to both classroom timetabling and application design. Zou et al. (2025) provide complementary neural evidence that spacing benefits require retrieval-mediated re-encoding, not passive consolidation alone.

AbdAlgane et al. (2025) compared equal-interval and expanded-interval mobile spaced learning among sixty ESP students learning scientific vocabulary. Both schedules improved vocabulary mastery, but only expanded spacing significantly improved broader academic performance—suggesting that schedule shape may matter for transfer beyond paired-associate recall, at least in cognitively demanding domain-specific curricula. Serfaty and Serrano (2024) found that additional relearning sessions spaced one day apart improved L2 grammar retention in an online crowdsourced study, though attrition remained a concern.

Evidence-backed card-design strategies partially address decontextualization:

Sentence-level versus word-level cards: sentence contexts support grammar and collocation learning; cloze formats offer higher throughput (Elaborative interrogation: d = 0.56; Donoghue & Hattie, 2021).
Sentence mining from authentic input: the "one-target" (1T) principle—creating cards only from sentences where a single element is unknown—maintains comprehensibility while preserving contextual encoding (Krashen, 1985).
Dual coding (Paivio, 1986): concurrent verbal and visual encoding strengthens retention; self-generated mnemonics outperform provided ones.
Context-highlighted ("anime") cards: a target lexical item within a sentence context, often with audio, can sustain contextual benefits at two to four times the review throughput of full sentence cards in community practice (Yudame Research, 2025).
AI-assisted generation: large language models can accelerate card creation but require integration with retrieval practice and feedback to avoid passive consumption (Dunlosky et al., 2013a). Forum surveys suggest a majority of medical students would adopt ChatGPT-generated Anki cards if reliable workflows existed—the barrier appears to be packaging and quality control rather than model capability (Yudame Research, 2025).

A quality–quantity trade-off persists: richer card design improves transfer potential but increases creation time, potentially reducing total practice volume.

3.6 Adoption Barriers in Education and Consumer Applications#

3.6.1 Institutional failure#

Dempster (1988) documented that despite the spacing effect being among the most dependable findings in experimental psychology, American classrooms and textbooks failed to implement systematic distributed review—a gap he attributed to curriculum structure, assessment timing, and institutional inertia. Soviet mathematics textbooks of the era provided more distributed practice than American equivalents despite equivalent access to psychological literature. Dunlosky et al. (2013b) noted that teacher-education textbooks rarely cover evidence-based study techniques, and curricula prioritize content over learning methodology.

Metacognitive misalignment compounds structural barriers: in controlled studies, 83% of participants rated massed practice as equally or more effective than spaced practice despite superior delayed retention for spacing (Kornell & Bjork, 2008). Students prefer cramming because it produces stronger immediate performance; spacing advantages manifest only after delay (Santoro, 2021). Simone (2021, as cited in Santoro, 2021) reported that even students aware of spacing benefits rarely employ them, because distributed practice "is a slower, harder way to remember information."

Dunlosky et al. (2013b) recommend low-stakes opening quizzes, cumulative examinations, study planners, retrieval practice over re-reading, interleaved problem sets, and explicit instruction on technique efficacy. Lindsey et al. (2014) argued that optimal spacing schedules exceed what teachers or students can manually coordinate without technological support.

3.6.2 Consumer application persistence#

SRS applications face a distinct failure mode: review backlog growth. Expanding-interval schedulers accumulate future obligations with each new item; at approximately twenty new cards per day under SM-2-style doubling, users may face 200–400 daily reviews within three months. Skipping sessions produces compounding backlogs that discourage re-engagement: illustrative community estimates place overdue reviews at roughly fifty after one missed day, 120 after two, 190 after three, and 280 after four for typical new-card loads (Yudame Research, 2025). The metacognitive illusion that effortful retrieval signals failure amplifies dropout when users return to large piles (Kornell & Bjork, 2008). Wozniak's (2018) early paper-based system—seventy-nine pages covering 2,794 word pairs—illustrates that review-burden scaling predates digital tools.

Successful long-term SRS use appears to require calibration rarely achieved without guidance: ten to twenty new items daily, completion of due reviews before adding new material, session limits of fifteen to thirty minutes, and sustained practice over approximately three months before benefits become salient (community guidelines; see Anki Manual, n.d.).

3.6.3 Engagement–learning tension#

Commercial language applications optimize for daily active users, session length, and conversion metrics that may conflict with optimal spacing. Independent research on Duolingo reports modest learning gains alongside issues of learner persistence, motivation, and program efficacy (Loewen et al., 2019; Vesselinov & Grego, 2012). A separate systematic review characterized the product's design as emphasizing competition over collaboration, repetition and translation over contextual feedback, and receptive skills over productive output—arguing that gamification novelty may not offset these structural limits once habituation occurs (Yudame Research, 2025). Duolingo's reported metrics—500+ million registered users, 103.6 million monthly active users, approximately 2% paid conversion, 7-day streak users 3.6× more likely to remain engaged—illustrate incentives to optimize engagement over delayed learning outcomes (Yudame Research, 2025). Monetization via mistake-limited "hearts" and ad-supported continuation further aligns session length with revenue rather than with Cepeda-style optimal gaps (Yudame Research, 2025).

Memrise's 2024–2025 product pivot—immersive personalization on the main app with legacy community courses relocated—exemplifies platform risk for learners who invested in user-generated content; migration threads on Anki forums document demand for portable decks when platforms reorient (Yudame Research, 2025). The Anki ecosystem inverts the engagement-first model: user-owned data, maximal customization, and community-driven algorithm development (FSRS integration, 2025), at the cost of steep onboarding and absent quality control.

Neither model has resolved low category-wide retention (approximately 2% at Day 30 for education apps; Business of Apps, 2026). Donoghue and Hattie's (2021) finding that lower-ability learners benefit most from spacing and testing suggests underserved populations could gain substantially from well-designed tools—provided product metrics align with learning outcomes.

3.7 Emerging directions (2024–2026)#

Three trends warrant monitoring. First, domain-specific meta-analyses are proliferating (medicine, mathematics, classroom-applied settings), revealing that spacing effect sizes are not uniform—medical and verbal domains show larger effects than mathematics, where procedural fluency and problem-solving transfer may require different retrieval formats (Murray et al., 2025; Maye & Hurley, 2026). Second, neuroimaging at scale is shifting mechanistic debate from rest-versus-rehearsal toward re-encoding and cortical representational stabilization (Zou et al., 2025; Yang et al., 2025). Third, AI-assisted card generation and adaptive scheduling are entering mainstream SRS workflows—add-ons such as AnkiAIUtils and GPT-integrated templates can generate explanations, mnemonics, and media from textbooks or PDFs, though long-term effectiveness data remain absent and community guidance stresses manual verification (Yudame Research, 2025). Martinengo et al. (2024) reported benefits of spaced digital education for health professionals in a prior systematic review, and French et al. (2024) formally called for spaced-repetition integration into medical curricula—contexts where Maye and Hurley (2026) now provide meta-analytic confirmation. Guided platforms are experimenting with AI video calls and adventure-style interaction (Duolingo, 2024–2025) to address the recognition–production gap; whether these features improve delayed transfer or primarily session metrics is not yet established (Yudame Research, 2025). The open question is whether AI-generated content preserves the retrieval quality and contextual richness that recent L2 research identifies as necessary for transfer (Nakata & Elgort, 2021; Serrano & Pellicer-Sánchez, 2026).

4. Discussion#

4.1 Synthesis#

This review identifies a fundamental mismatch between spacing-effect efficacy under controlled conditions and outcomes in naturalistic learning environments. The spacing effect is robust at molecular, neural, and behavioral levels; distributed practice and retrieval testing rank highest among learner-controlled techniques; and modern schedulers predict forgetting with increasing precision. Nevertheless, real-world failure modes cluster around five interrelated factors:

Transfer limits: SRS primarily builds recognition knowledge with weak far transfer to productive skills.
Context dependence: interface-bound learning may not activate in dissimilar retrieval contexts.
Metacognitive misalignment: learners and institutions prefer massed strategies that optimize immediate performance.
Infrastructure deficits: manual spacing exceeds practical coordination capacity without technological support, yet existing tools impose unsustainable review loads or misaligned engagement incentives.
Measurement mismatch: application metrics track scheduler compliance, not competence.

4.2 Theoretical implications#

Recent evidence favoring mental rehearsal over working-memory recovery during inter-session intervals (Chan et al., 2025) and re-encoding-dependent vmPFC representational stabilization (Zou et al., 2025) suggests that spacing schedule design should account for activities between sessions and the success of retrieval during re-exposure, not only interval duration. Prior knowledge moderates spacing effectiveness, supporting adaptive schedules that shorten initial gaps for novices and extend them as schemas develop. The dissociation between scheduler prediction accuracy and learning outcomes implies that research should distinguish forecasting models from intervention models—a distinction largely absent from current algorithm benchmarking.

Domain-specific meta-analyses further suggest that spacing recommendations should be calibrated to content type. Medical-education spacing effects (SMD = 0.78; Maye & Hurley, 2026) exceed mathematics spacing effects (g = 0.28; Murray et al., 2025), which in turn may depend on whether material is isolated or embedded in coursework. Testing effects appear robust in verbal domains but uncertain in mathematics (Murray et al., 2025), implying that SRS implementations for quantitative or procedural content may require format-specific retrieval tasks rather than generic flashcard recognition.

4.3 Implications for learning technology design#

Evidence supports several design priorities for spaced-repetition systems: (a) default distributed scheduling rather than optional spacing; (b) sustainable review-load management with transparent onboarding projections; (c) rich contextual card content (sentences, media, authentic sources); (d) integration with production practice rather than replacement of it; and (e) feedback-accompanied retrieval practice. Algorithmic sophistication yields diminishing returns relative to adherence, card quality, and session sustainability.

4.4 Limitations of this review#

This synthesis is narrative rather than systematic; no protocol was preregistered and grey literature was included for commercial implementation sections. Many cited language-learning effect sizes derive from paired-associate paradigms with limited ecological validity. Industry retention statistics depend on proprietary reporting. Online multisession studies exhibit substantial attrition (Rogers et al., 2025), limiting generalization to self-directed SRS use. Causal claims about product design and learning outcomes remain correlational.

5. Conclusion#

The spacing effect represents one of the best-supported principles in learning science, with converging evidence from CREB-mediated consolidation, hippocampal–cortical and vmPFC re-encoding mechanisms, and meta-analytic effect sizes from d = 0.54 in classroom settings (Mawson & Kang, 2025) to SMD = 0.78 in medical education (Maye & Hurley, 2026). Computational spaced repetition successfully operationalizes this principle at scale. However, the translation from laboratory efficacy to durable, transferable competence—particularly in second-language production—remains incomplete. Flashcard-based SRS builds a necessary but insufficient vocabulary foundation; authentic input, production under communicative pressure, and context-rich encoding are required complements, not alternatives to be optimized away.

The forty-year research–implementation gap in formal education and the high abandonment rates of consumer learning applications share a common cause: systems designed for immediate performance, engagement metrics, or algorithmic elegance rather than for sustainable, transferable learning under real-world constraints. Closing this gap requires treating spacing as a scheduling infrastructure problem—supported by technology, calibrated to learner capacity, and embedded within broader acquisition ecosystems—rather than as a standalone study hack.

References#

AbdAlgane, M., Elkot, M. A., & Ali, R. (2025). Embracing mobile-based spaced learning to enhance English scientific vocabulary mastery and academic performance among ESP students. International Journal of Interactive Mobile Technologies, 19(12), 103–120. https://doi.org/10.3991/ijim.v19i12.54815

Anki Manual. (n.d.). Studying—limits and optimal retention. AnkiWeb. https://docs.ankiweb.net/

Bjork, R. A., & Bjork, E. L. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In M. A. Gernsbacher, R. W. Pew, L. M. Hough, & J. R. Pomerantz (Eds.), Psychology and the real world: Essays illustrating fundamental contributions to society (pp. 56–64). Worth Publishers.

Business of Apps. (2026). Education app benchmarks. https://www.businessofapps.com/data/education-app-benchmarks/

Cepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (2009). Optimizing distributed practice: Theoretical analysis and practical implications. Experimental Psychology, 56(4), 236–246.

Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380.

Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological Science, 19(11), 1095–1102.

Chan, K.-Y., Chen, O., & Paas, F. (2025). The mechanisms of the spacing effect: Mental rehearsal versus rest and recovery. Educational Psychology, 45. https://doi.org/10.1080/01443410.2025.2551144

DeKeyser, R. M. (2015). Skill acquisition theory. In B. VanPatten & J. Williams (Eds.), Theories in second language acquisition (2nd ed., pp. 94–112). Routledge.

Dempster, F. N. (1988). The spacing effect: A case study in the failure to apply the results of psychological research. American Psychologist, 43(8), 627–634.

Donoghue, G. M., & Hattie, J. A. C. (2021). A meta-analysis of ten learning techniques. Frontiers in Education, 6, 581216. https://doi.org/10.3389/feduc.2021.581216

Dudai, Y., Karni, A., & Born, J. (2015). The consolidation and transformation of memory. Neuron, 88(1), 20–32.

Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013a). Improving students' learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14(1), 4–58.

Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013b). Strengthening the student toolbox: Study strategies to boost learning. American Educator, 37(3), 12–21.

Ebbinghaus, H. (1913). Memory: A contribution to experimental psychology (H. A. Ruger & C. E. Bussenius, Trans.). Teachers College, Columbia University. (Original work published 1885)

Expertium. (2024). A technical explanation of FSRS. Expertium's Blog. https://expertium.github.io/Algorithm.html

Expertium. (2025). Benchmark of spaced repetition algorithms. Expertium's Blog. https://expertium.github.io/Benchmark.html

Feng, K., Zhao, X., Liu, J., Cai, Y., Ye, Z., Chen, C., & Xue, G. (2019). Spaced learning enhances episodic memory by increasing neural pattern similarity across repetitions. Journal of Neuroscience, 39(27), 5351–5360.

French, B. N., Marxen, T. O., Akhnoukh, S., et al. (2024). A call for spaced repetition in medical education. The Clinical Teacher, 21(1), e13669. https://doi.org/10.1111/tct.13669

Godden, D. R., & Baddeley, A. D. (1975). Context-dependent memory in two natural environments. British Journal of Psychology, 66(3), 325–331.

González-Fernández, B. (2025). How is vocabulary learnt? An acquisitional sequence of L2 word knowledge. TESOL Quarterly, 59(2), 755–784.

González-Fernández, B., & Schmitt, N. (2020). Word knowledge: Exploring the relationships and order of acquisition of vocabulary knowledge components. Applied Linguistics, 41(4), 481–505.

Hendrick, C. (2025, September 4). What makes spaced practice so powerful? Substack. https://carlhendrick.substack.com/p/what-makes-spaced-practice-so-powerful

Kang, S. H. K., Lindsey, R. V., Mozer, M. C., & Pashler, H. (2014). Retrieval practice over the long term: Should spacing be expanding or equal-interval? Psychonomic Bulletin & Review, 21(6), 1544–1550.

Kapler, I. V., Weston, T., & Wiseheart, M. (2015). Spacing in a simulated undergraduate classroom: Long-term benefits for factual and higher-level learning. Learning and Instruction, 36, 38–45.

Kim, S. K., & Webb, S. (2022). The effects of spaced practice on second language learning: A meta-analysis. Language Learning, 72(1), 269–319. https://doi.org/10.1111/lang.12479

Koriat, A. (1997). Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126(4), 349–370.

Kornell, N., & Bjork, R. A. (2008). Learning concepts and categories: Is spacing the "enemy of induction"? Psychological Science, 19(6), 585–592.

Krashen, S. D. (1985). The input hypothesis: Issues and implications. Longman.

Lindsey, R. V., Shroyer, J. D., Pashler, H., & Mozer, M. C. (2014). Improving students' long-term knowledge retention through personalized review. Psychological Science, 25(3), 639–647.

Liu, J., & Zhang, J. (2018). The effects of extensive reading on English vocabulary learning: A meta-analysis. English Language Teaching, 11(6), 1–15. https://doi.org/10.5539/elt.v11n6p1

London Now. (2021, December 29). Why do so few people complete a Duolingo course? London Now. https://www.london-now.co.uk/news/19888617.people-complete-duolingo-course/

Loewen, S., Crowther, D., Isbell, D. R., Kim, K. M., Maloney, J., Miller, Z. F., & Rawal, H. (2019). Mobile-assisted language learning: A Duolingo case study. ReCALL, 31(3), 293–311. https://doi.org/10.1017/S0958344019000065

Martinengo, L., Ng, M. S. P., Ng, T. D. R., Ang, Y. I., Jabir, A. I., Kyaw, B. M., & Car, L. T. (2024). Spaced digital education for health professionals: Systematic review and meta-analysis. Journal of Medical Internet Research, 26, e57760.

Mawson, R. D., & Kang, S. H. K. (2025). The distributed practice effect on classroom learning: A meta-analytic review of applied research. Behavioral Sciences, 15(6), 771. https://doi.org/10.3390/bs15060771

Maye, J. A., & Hurley, F. (2026). The effectiveness of spaced repetition in medical education: A systematic review and meta-analysis. The Clinical Teacher, 23(2). https://doi.org/10.1111/tct.70353

Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels of processing versus transfer appropriate processing. Journal of Verbal Learning and Verbal Behavior, 16(5), 519–533.

Murray, E., Horner, A. J., & Göbel, S. M. (2025). A meta-analytic review of the effectiveness of spacing and retrieval practice for mathematics learning. Educational Psychology Review, 37, 75. https://doi.org/10.1007/s10648-025-10035-1

Nakata, T., & Elgort, I. (2021). Effects of spacing on contextual vocabulary learning: Spacing facilitates the acquisition of explicit, but not tacit, vocabulary knowledge. Language Learning, 71(4), 1238–1265.

Naqib, F., Sossin, W. S., & Farah, C. A. (2012). Molecular determinants of the spacing effect. Neural Plasticity, 2012, 581291.

Nuthall, G. (2007). The hidden lives of learners. NZCER Press.

Open Spaced Repetition. (2024). FSRS algorithm benchmarks. GitHub Wiki. https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Algorithm

Open Spaced Repetition. (2025). FSRS-6 release notes and Anki 25.07 integration. GitHub. https://github.com/open-spaced-repetition/fsrs4anki

Paivio, A. (1986). Mental representations: A dual coding approach. Oxford University Press.

Philips, G. T., Kopec, A. M., & Carew, T. J. (2013). Pattern and predictability in memory formation: From molecular mechanisms to clinical relevance. Neurobiology of Learning and Memory, 105, 117–124.

Popay, J., Roberts, H., Sowden, A., Petticrew, M., Arai, L., Rodgers, M., & Britten, N. (2006). Guidance on the conduct of narrative synthesis in systematic reviews. ESRC Methods Programme, 15(1), 047.

Price, D. W., Wang, T., O'Neill, T. R., Morgan, Z. J., Chodavarapu, P., Bazemore, A., Peterson, L. E., & Newton, W. P. (2025). The effect of spaced repetition on learning and knowledge transfer in a large cohort of practicing physicians. Academic Medicine, 100(1), 94–102. https://doi.org/10.1097/ACM.0000000000005856

Refold. (n.d.). Refold methodology overview. https://refold.la/

Rogers, J., Nakata, T., & Chiu, M. M. (2025). Optimizing distributed practice online: A conceptual replication of Cepeda et al. (2009). Studies in Second Language Acquisition, 47(1), 417–439. https://doi.org/10.1017/S0272263124000706

Rodd, J. M. (2024). Moving experimental psychology online: How to obtain high-quality data when we can't see our participants. Journal of Memory and Language, 134, 104472.

Roediger, H. L., III, & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17(3), 249–255.

Rowland, C. A. (2014). The effect of testing versus restudy on retention: A meta-analytic review of the testing effect. Psychological Bulletin, 140(6), 1432–1463.

San Martin, A., Rela, L., Gelb, B. D., & Pagani, M. R. (2017). The spacing effect for structural synaptic plasticity provides specificity and precision in plastic changes. Journal of Neuroscience, 37(19), 4992–5007.

Santoro, H. (2021, March 4). The neuroscience behind the spacing effect. BrainFacts. https://www.brainfacts.org/thinking-sensing-and-behaving/learning-and-memory/2021/the-neuroscience-behind-the-spacing-effect-030421

Serfaty, J., & Serrano, R. (2024). Practice makes perfect, but how much is necessary? The role of relearning in second language grammar acquisition. Language Learning, 74(1), 218–248.

Serrano, R., & Pellicer-Sánchez, A. (2026). The impact of practice conditions on vocabulary learning and processing: A closer look at difficulties arising from spacing and context variability. Applied Psycholinguistics. https://doi.org/10.1017/S0142716425100283

Stewart, J., Gyllstad, H., Nicklin, C., & McLean, S. (2024). Establishing meaning recall and meaning recognition vocabulary knowledge as distinct psychometric constructs in relation to reading proficiency. Language Testing, 41(1), 89–108. https://doi.org/10.1177/02655322231162853

Vesselinov, R., & Grego, J. (2012). The Duolingo efficacy study [White paper]. Duolingo. https://static.duolingo.com/s3/DuolingoReport_Final.pdf

Wozniak, P. (2018, June 1). The true history of spaced repetition. SuperMemo Blog. https://www.supermemo.com/en/blog/the-true-history-of-spaced-repetition

Yang, Y., Huang, Z., Yang, Y., Fan, M., & Yin, D. (2025). Time-dependent consolidation mechanisms of durable memory in spaced learning. Communications Biology, 8(1), 535. https://doi.org/10.1038/s42003-025-07964-6

Ye, J., & the open-spaced-repetition community. (2024). FSRS: Free Spaced Repetition Scheduler [Computer software and documentation]. GitHub. https://github.com/open-spaced-repetition/fsrs4anki

Yudame Research. (2025). Algorithms for Life, Ep. 1: The algorithm that remembers—why 140 years of memory science still can't make you fluent [Research report]. Yudame Research Podcast. https://yudame.ai/podcast/yudame-research/ep1-spaced-repetition/report/

Zou, F., Kuhl, B. A., DuBrow, S., & Hutchinson, J. B. (2025). Benefits of spaced learning are predicted by the re-encoding of past experience in ventromedial prefrontal cortex. Cell Reports, 44(2), 115232. https://doi.org/10.1016/j.celrep.2025.115232