Feb 22, 2026
# CRITICAL EXAMINATION: AI Tutoring at Scale Brief
## Weakest Assumptions & Logical Leaps
### 1. **"14% improvement in math proficiency" (Newark/Khanmigo)**
**Demand for operational definition:** What exactly constitutes "math proficiency scores"? State standardized tests? Internal Khan Academy metrics? Classroom assessments? A 14% improvement on a low-stakes internal assessment is categorically different from a 14% improvement on state proficiency rates.
**Missing baselines:** 14% improvement *from what baseline*? If Newark started at 20% proficiency, moving to 34% is meaningful. If they started at 60%, this is extraordinary. Without this, the number is decorative.
**Missing comparison:** Was there a control group? What was the counterfactual—students receiving no intervention, traditional tutoring, or business-as-usual instruction? **Label: UNVERIFIED without peer-reviewed publication or district-released methodology.**
### 2. **"2 million students reached" (Khanmigo)**
**Operational definition needed:** What does "reached" mean? Accounts created? Logged in once? Used for 10+ hours? Completed a learning module? EdTech is notorious for conflating "deployment" with "usage" with "learning."
**Missing unit:** Student-hours of actual engagement would be the credible metric. 2 million students × 5 minutes each is not a tutoring intervention.
### 3. **"8,000+ U.S. schools"**
**Missing denominator and distribution:** Out of ~130,000 K-12 schools. Is this concentrated in wealthy suburban districts that could afford $44/student, or genuinely diverse? The "scale" claim requires demographic breakdown.
### 4. **Mindspark RCT (J-PAL, 2017)**
This is the *only* credibly sourced claim in the brief. However:
- **Time window problem:** This is 7-year-old data on different technology (pre-LLM adaptive learning). Extrapolating Mindspark's 2017 results to justify 2024 LLM-based tutoring is a category error.
- **Context specificity:** 0.36 SD gains in *supplementary* computer lab time in Indian government schools may not transfer to U.S. classroom integration or home use.
### 5. **"$44/student/year" cost model**
**Missing comparison:** $44 vs. what alternative? Human tutoring at $40-80/hour makes this look cheap. But vs. free Khan Academy videos + teacher support? The value proposition requires cost-per-outcome-unit, not cost-per-seat.
---
## Strongest Claim & Why It's Likely Overstated
**The Newark 14% improvement is the headline claim and the most suspect.**
- One semester is insufficient for durable learning effects (summer fade, novelty effects)
- "Pilot" studies systematically outperform at-scale deployment (Hawthorne effect, selection bias in participating teachers)
- No mention of implementation fidelity—were teachers trained? Was usage mandated or optional?
- **Counterexample:** The IES What Works Clearinghouse consistently shows EdTech pilots failing to replicate at scale. The 2023 RAND study on pandemic-era tutoring showed high-dosage human tutoring produced ~0.2 SD gains; claiming AI tutoring exceeds this without rigorous methodology is extraordinary.
---
## Two Missing Data Points
1. **Dosage data:** Average minutes/week of actual AI tutor interaction per student, with distribution (median, not just mean). Without this, we cannot distinguish "tutoring" from "occasional homework help."
2. **Differential effects by student subgroup:** Does AI tutoring help struggling students catch up, or does it primarily accelerate already-proficient students? The equity claim implicit in "scale" requires disaggregated data by prior achievement
## Weakest Assumptions & Logical Leaps
### 1. **"14% improvement in math proficiency" (Newark/Khanmigo)**
**Demand for operational definition:** What exactly constitutes "math proficiency scores"? State standardized tests? Internal Khan Academy metrics? Classroom assessments? A 14% improvement on a low-stakes internal assessment is categorically different from a 14% improvement on state proficiency rates.
**Missing baselines:** 14% improvement *from what baseline*? If Newark started at 20% proficiency, moving to 34% is meaningful. If they started at 60%, this is extraordinary. Without this, the number is decorative.
**Missing comparison:** Was there a control group? What was the counterfactual—students receiving no intervention, traditional tutoring, or business-as-usual instruction? **Label: UNVERIFIED without peer-reviewed publication or district-released methodology.**
### 2. **"2 million students reached" (Khanmigo)**
**Operational definition needed:** What does "reached" mean? Accounts created? Logged in once? Used for 10+ hours? Completed a learning module? EdTech is notorious for conflating "deployment" with "usage" with "learning."
**Missing unit:** Student-hours of actual engagement would be the credible metric. 2 million students × 5 minutes each is not a tutoring intervention.
### 3. **"8,000+ U.S. schools"**
**Missing denominator and distribution:** Out of ~130,000 K-12 schools. Is this concentrated in wealthy suburban districts that could afford $44/student, or genuinely diverse? The "scale" claim requires demographic breakdown.
### 4. **Mindspark RCT (J-PAL, 2017)**
This is the *only* credibly sourced claim in the brief. However:
- **Time window problem:** This is 7-year-old data on different technology (pre-LLM adaptive learning). Extrapolating Mindspark's 2017 results to justify 2024 LLM-based tutoring is a category error.
- **Context specificity:** 0.36 SD gains in *supplementary* computer lab time in Indian government schools may not transfer to U.S. classroom integration or home use.
### 5. **"$44/student/year" cost model**
**Missing comparison:** $44 vs. what alternative? Human tutoring at $40-80/hour makes this look cheap. But vs. free Khan Academy videos + teacher support? The value proposition requires cost-per-outcome-unit, not cost-per-seat.
---
## Strongest Claim & Why It's Likely Overstated
**The Newark 14% improvement is the headline claim and the most suspect.**
- One semester is insufficient for durable learning effects (summer fade, novelty effects)
- "Pilot" studies systematically outperform at-scale deployment (Hawthorne effect, selection bias in participating teachers)
- No mention of implementation fidelity—were teachers trained? Was usage mandated or optional?
- **Counterexample:** The IES What Works Clearinghouse consistently shows EdTech pilots failing to replicate at scale. The 2023 RAND study on pandemic-era tutoring showed high-dosage human tutoring produced ~0.2 SD gains; claiming AI tutoring exceeds this without rigorous methodology is extraordinary.
---
## Two Missing Data Points
1. **Dosage data:** Average minutes/week of actual AI tutor interaction per student, with distribution (median, not just mean). Without this, we cannot distinguish "tutoring" from "occasional homework help."
2. **Differential effects by student subgroup:** Does AI tutoring help struggling students catch up, or does it primarily accelerate already-proficient students? The equity claim implicit in "scale" requires disaggregated data by prior achievement