1 Introduction

The pitch is seductive. An infinitely patient tutor, available at any hour, fluent in every mathematical topic from long division to differential equations — and it never loses its temper when a student asks the same question for the fifth time. Across the edtech industry, “AI”-powered math tools are positioned as a revolution: democratizing access, personalizing learning, and meeting students where they are. Yet the uncritical enthusiasm surrounding these tools deserves scrutiny — as Al-Zahrani (2024) warns, the hype around AI in education frequently obscures its fundamental limitations and risks.

But what if the tutor doesn’t understand mathematics? What if it has never understood mathematics, is structurally incapable of understanding mathematics, and — most troublingly — is designed to hide that fact behind a mask of unwavering confidence?

Recent research from Anthropic, MIT, and OpenAI itself has converged on an unexpectedly transparent conclusion: large language models do not reason mathematically. They approximate. They guess. And they fabricate explanations of how they arrived at their answers. Deploying such a tool in the service of mathematics education is not merely unhelpful — it is actively corrosive to the very thing education is supposed to develop.

2 What a Large Language Model Actually Is

Before examining why LLMs fails at mathematics, it is worth understanding what a large language model actually does — because the gap between public perception and mechanical reality is where much of the confusion begins.

A large language model — the technology behind ChatGPT, Claude, Gemini, and their competitors — is, at its core, a next-token predictor. Given a sequence of words (or more precisely, tokens — fragments of words, numbers, and punctuation), the model’s sole task is to predict what comes next. It does this over and over, one token at a time, until it has generated a complete response.

llm
Figure 1: The basic loop of a large language model. The model receives a sequence of tokens and predicts the most probable next one. This prediction is appended, and the process repeats. Every response — whether a poem, a proof, or an apology — is generated through this single mechanism.

This is a crucial point to internalize: the model is not “thinking” about your question, retrieving an answer from a database, or applying rules of logic. It is producing the statistically most likely next word, given everything that precedes it. When it produces a correct mathematical answer, it has not computed anything — it has predicted a sequence of tokens that looks like a correct mathematical answer, based on patterns absorbed from its training data.

The training process itself reinforces this. A language model learns by consuming vast amounts of text — books, websites, academic papers, forum posts, code repositories — and adjusting its internal parameters (billions of numerical weights) so that its predictions better match the patterns in that data. The model that emerges from this process is extraordinarily good at producing text that resembles human-written text. It has, in a meaningful sense, learned the shape of human language — the rhythms, the conventions, the statistical regularities.

What it has not learned is what any of it means.

llm
Figure 2: The training process and its fundamental gap. The model learns to predict tokens by absorbing statistical patterns from text. It learns the form of mathematical reasoning — but not the substance.

An intuitive way to think about this: if you see the sentence “all elephants aren’t ___”, you can guess the next word is probably “pink”, “the” or “zebras” or whatever the context suggests. You are predicting based on the pattern of the sentence. A language model does the same thing, but scaled to billions of parameters and trillions of training examples.

llm
Figure 3: A fully connected feed-forward neural network for next-token prediction. Three input neurons receive token embeddings for the words “all”, “elephants”, and “aren’t”. These are propagated through hidden layers, each fully connected to the next — meaning every neuron in one layer is linked to every neuron in the subsequent layer via a learned weight. A single output neuron produces the predicted token “pink”.

Now consider a different prompt: 4 847 * 391 =. You cannot intuitively predict the answer from the pattern of the text. You would need to compute it — applying the rules of multiplication step by step, or reaching for a calculator. The answer is not a matter of linguistic probability. It is a matter of mathematical procedure.

This is the core tension. Language models are prediction machines operating in a domain — mathematics — where prediction is not the right tool. A student can intuitively guess the next word in a sentence, and so can an LLM. Neither can intuitively guess the product of two four-digit numbers. The difference is that the student knows to reach for a different method. The language model does not have a different method. It has only prediction — and when prediction is not enough, it guesses anyway.

3 The Illusion of Competence

In March 2025, Anthropic’s interpretability team published On the Biology of a Large Language Model (Lindsey et al., 2025), a landmark study that peered inside the internal mechanisms of Claude 3.5 Haiku to understand how it actually processes information. Among their case studies was a simple one: two-digit addition. The prompt was 36 + 59 =.

What they found was noteworthy. The model does not add. It does not perform something that’s even resembling addition. It runs parallel heuristic pathways — one approximate, one modular — and combines them to converge on an answer. A low-precision pathway activates features like “add something near 36 to something near 60”, producing a “sum is near 92” signal. A separate high-precision pathway tracks only the final digits: “something ending in 6 plus something ending in 9” yields “sum ends in 5”. These two signals combine: near 92, ends in 5 — the answer must be 95.

llm
Figure 4: How Claude 3.5 Haiku actually computes 36 + 59, as revealed by Anthropic’s attribution graph analysis. Two parallel heuristic pathways — one approximate, one tracking only final digits — converge on an answer. No carrying, no place-value reasoning, no arithmetic.

This is not arithmetic. It is targeted guessing, refined through overlapping heuristics until the confidence is high enough to commit to an output. There is no carrying, no place-value reasoning, no mathematical operation in any meaningful sense. The model has memorized a lookup table for single-digit additions and wrapped it in a series of approximate pattern-matching circuits.

But the truly alarming finding came next. When the researchers asked the model to explain how it computed 36 + 59, it responded with a fluent, step-by-step account: “I added the ones (6 + 9 = 15), carried the 1, then added the tens (3 + 5 + 1 = 9), resulting in 95”. A perfectly human explanation — and a complete fabrication. The model’s actual internal process bore no resemblance to the explanation it offered. As the researchers noted, the mechanism by which the model learns to do something and the mechanism by which it learns to explain something are entirely separate processes.

Consider what this means in an educational context. A student asks an “AI tutor” to explain how to add two numbers. The LLM produces a clear, confident, step-by-step explanation that maps perfectly onto the standard algorithm taught in schools. The student nods along. But the explanation was not derived from any understanding of addition — it was pattern-matched from training data that contained similar explanations. The LLM does not understand why carrying works, what place value means, or what addition is. It simply retrieved a plausible-sounding sequence of words.

The student has received an answer. They have not received understanding.

4 Structural Limits, Not Growing Pains

A common defense is that these are early days — that models will improve, that the next generation will be smarter. The research suggests otherwise. The limitations are architectural, not developmental.

In May 2025, researchers at MIT published Superposition Yields Robust Neural Scaling (Liu et al., 2025), a paper that finally explained mathematically why bigger language models perform better — and, in doing so, revealed a ceiling. The key finding is that language models operate in what the researchers call strong superposition: they store far more concepts than they have dimensions to represent them. GPT-2, for example, crams roughly 50,000 tokens into approximately 4,000 dimensions. Nothing is discarded. Everything overlaps.

llm
Figure 5: Strong superposition in LLMs (Liu et al., 2025). All tokens are stored in a space with far fewer dimensions than there are tokens. The resulting interference decreases with model width (1/m), explaining why scaling works — and why it has a ceiling. Bigger models are not smarter; they simply have more room for the same compressed, overlapping information.

This means that every piece of information inside the model interferes with every other piece. The token “derivative” shares representational space with “integral”, “derivation”, and “derive” — not because the model understands their mathematical relationships, but because they must all coexist in the same compressed geometry. A number like 42 overlaps with 420, 4.2, and every other token that happens to occupy a nearby region of that space. The interference between these overlapping representations follows a precise mathematical law: it scales as , where m is the model’s width. Double the model, halve the interference. Double it again, halve it again.

This is why scaling works — and why it has a limit. Bigger models are not smarter. They simply have more room, so the compressed information interferes less. The model is not developing deeper mathematical understanding as it scales. It is merely reducing the noise in its pattern-matching. The representations remain compressed, overlapping, and fundamentally approximate. As Liu, Liu, and Gore (2025) confirmed empirically across multiple model families, this is not a quirk of one architecture — it is a geometric property of how neural networks store information.

The physicist Sabine Hossenfelder (2025) frames this limitation sharply: these models interpolate, they do not extrapolate. They are capable of producing something similar to what already exists in their training data. They struggle with anything genuinely new. This is precisely the distinction that matters in mathematics — and one that Gary Marcus (2022) has argued is the central obstacle to artificial general intelligence. A student who can only reproduce familiar problem patterns has not learned mathematics — they have learned to mimic it. An LLM that can only interpolate between known examples is, in this sense, the perfect anti-tutor: a tool that models the very failure mode education should be working to overcome.

Moreover, the model lacks what researchers call grounded semantics — any connection between mathematical symbols and their meaning. Humans develop mathematical intuition through interaction with the physical world: counting objects, measuring distances, experiencing quantity. The number 5 is not merely a token to us — it is five fingers, five apples, five steps. For a language model, 5 is a point in a high-dimensional space that happens to be near other number-tokens. There is no concept of magnitude, no sense of quantity, no understanding that 9.9 is larger than 9.11 — a mistake ChatGPT was still making as recently as July 2024. Without clear context — or specifically from the perspective of software versioning — yes, 9.11 could be interpreted as more recent/larger than 9.9.

5 The Confidence Trap

Even setting aside the question of mathematical competence, there is the matter of honesty. In September 2025, OpenAI published Why Language Models Hallucinate (Kalai et al., 2025), a paper that proved mathematically that hallucinations — confident, incorrect outputs — are not a bug but an inevitable consequence of how these models are trained and evaluated.

The core finding is stark: even with perfect training data, the generative error rate of a language model is at least twice the error rate the same model would have on a simple yes/no question (Kalai et al., 2025). Errors accumulate across the sequential predictions that make up a response. And for facts that appear rarely in training data, hallucination is essentially guaranteed. The researchers demonstrated that if 20% of a particular type of fact appears only once in the training data, the model will get at least 20% of queries about those facts wrong.

But the deeper problem is structural. Kalai et al. (2025) examined ten major AI benchmarks — the tests that determine which models are considered “best” — and found that nine of them use binary grading: full marks for a correct answer, zero for anything else. Saying “I don’t know” scores the same as being completely wrong. Under this regime, the mathematically optimal strategy is always to guess. The entire ecosystem — training, evaluation, deployment — is designed to produce models that sound confident regardless of their actual certainty.

llm
Figure 6: The binary grading trap. Under the scoring regime used by 9 out of 10 major AI benchmarks, expressing uncertainty (“I don’t know”) receives the same score as being wrong (0 points). The mathematically optimal strategy is always to guess — training models to sound confident regardless of actual certainty (Kalai et al., 2025).

OpenAI’s proposed solution is revealing: modify benchmarks to penalize wrong answers more than silence, effectively rewarding models for saying “I don’t know”. The mathematician Wei Xing (2025), writing in The Conversation, pointed out the commercial impossibility of this: if ChatGPT started responding “I don’t know” to even 30% of queries, users would abandon it overnight. The business model requires confidence. Uncertainty is not a feature the market will tolerate.

This creates a precise and dangerous dynamic for mathematics education. The student is interacting with a tool that is structurally incentivized to sound certain even when it is wrong. It will not flag its own uncertainty. It will not say “I’m not sure about this step”. It will present fabricated reasoning with the same fluency and conviction as correct reasoning. And the student — who is, by definition, not yet equipped to tell the difference — will have no way to distinguish between the two.

6 The Workaround Is the Tell

There is a revealing detail in how AI companies have addressed the math problem in practice. As several commentators have noted, when ChatGPT encounters a mathematical problem, it increasingly routes the computation to Python code — using an actual calculator rather than attempting the math itself. OpenAI is, in effect, acknowledging that its language model cannot do math by quietly handing the work to a tool that can.

This is sometimes presented as a strength: the model knows when to use a tool. But the failure mode is instructive — errors now occur primarily when the model fails to recognize that it should be using a calculator. In other words, the remaining mathematical judgment — knowing what kind of problem this is, knowing that it requires computation rather than pattern-matching — is precisely the kind of reasoning the model cannot reliably do.

And here we arrive at a deeper question: if the best-case scenario is that the LLM acts as an interface to actual computational tools, why use the LLM at all? Calculators exist. Computer algebra systems exist. These tools are precise, transparent, and deterministic. They do not hallucinate. They do not fabricate explanations. They do not compress a rigorous mathematical concept into a plausible-sounding simplification.

7 The Trivialization Problem

There is a final argument sometimes offered in defense of LLM in mathematics education: that even if the model cannot do math, it can serve as a kind of translator — reading a rigorous paper or textbook explanation and presenting it in more accessible language. This is the “best case” for LLM as a mathematical intermediary.

But even this best case deserves scrutiny. What happens when a rich, carefully constructed mathematical explanation is passed through a language model? The model compresses it. It strips away the nuance that made the explanation precise. It replaces technical language with approximations. It may produce something that sounds clear — but clarity achieved through simplification is not the same as clarity achieved through understanding.

And then we must ask the harder question: even if the LLM’s simplified explanation is adequate — even if it is, somehow, perfectly faithful to the source material — what have we lost?

Mathematics education is not primarily about the transfer of correct answers from one entity to another. It is about the development of reasoning. The struggle to understand a difficult proof, the moment of confusion before clarity, the experience of working through notation that initially seems impenetrable — these are not obstacles to learning. They are the learning. They are the process by which a student develops the capacity for mathematical thought. Research on cognitive offloading suggests this concern is well-founded: Grinschgl, Papenmeier, and Meyerhoff (2021) found that while offloading cognitive tasks to external tools boosts immediate performance, it significantly diminishes memory and learning — precisely the trade-off at stake when students delegate mathematical reasoning to an LLM. Risko and Gilbert (2016) describe this more broadly as a fundamental tension in cognitive offloading: the convenience of external tools comes at the cost of internal cognitive development. More recently, Kosmyna et al. (2025) demonstrated that using ChatGPT for writing tasks leads to measurable accumulation of cognitive debt — reduced neural engagement and diminished learning outcomes — suggesting that the same dynamic extends to AI-assisted intellectual work more generally.

When we insert an LLM between the student and the source material — between the student and the struggle — we do not remove a barrier. We remove the exercise. It is as if we placed a student in a gym and then hired someone to lift the weights on their behalf. The task appears to have been completed. The muscle has not been built.

The mathematician’s craft is not merely knowing that a theorem is true. It is understanding why it is true, being able to reconstruct the reasoning, holding the logical structure in one’s mind. A student who receives a pre-digested explanation from an LLM has been given a fish. They have not been taught to fish. They have not even been shown what a fishing rod looks like — they have been shown a photograph of a fish and told that this is what fishing produces. As UNESCO (2023) cautions in its guidance on AI in education, the deployment of these tools must not come at the expense of the cognitive development and critical thinking they are ostensibly meant to support.

8 Conclusion

The case against LLM in mathematics education rests not on the claim that current models are imperfect — that they will improve with time and scale. It rests on the convergent findings of the field’s leading researchers that the limitations are structural.

Large language models do not perform mathematical reasoning. They execute heuristic pattern-matching and present the results with fabricated explanations (Lindsey et al., 2025). Their architecture stores information through overlapping, compressed representations that improve with scale only by reducing interference — not by developing deeper understanding (Liu et al., 2025). They are trained and evaluated by systems that reward confident guessing over honest uncertainty, and the commercial incentives that sustain them make this effectively unfixable (Kalai et al., 2025; Xing, 2025).

A tool that cannot reason mathematically, cannot reliably know when it is wrong, and is incentivized to hide its uncertainty is not a suitable foundation for mathematics education. It is, in fact, a precise inversion of what mathematics education should be: a discipline built on rigor, transparency, logical structure, and the honest acknowledgment of what we do and do not know.

The productive struggle of learning mathematics cannot be outsourced. And the tool we are being asked to outsource it to does not understand what it is pretending to teach.

Al-Zahrani, A. M. (2024). Unveiling the shadows: Beyond the hype of AI in education. Heliyon, 10(9), e30696. https://doi.org/10.1016/j.heliyon.2024.e30696
Grinschgl, S., Papenmeier, F., & Meyerhoff, H. S. (2021). Consequences of cognitive offloading: Boosting performance but diminishing memory. Quarterly Journal of Experimental Psychology, 74(9), 1477–1496. https://doi.org/10.1177/17470218211008060
Hossenfelder, S. (2025, October 19). Current AI Models have 3 Unfixable Problems. Backreaction. https://backreaction.blogspot.com/2025/10/current-ai-models-have-3-unfixable.html
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, Y. (2025). Why Language Models Hallucinate. arXiv preprint arXiv:2509.04664. https://arxiv.org/abs/2509.04664
Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task. arXiv preprint. https://arxiv.org/abs/2506.08872
Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., … Batson, J. (2025). On the Biology of a Large Language Model. Transformer Circuits. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Liu, Y., Liu, Z., & Gore, J. (2025). Superposition Yields Robust Neural Scaling. arXiv preprint arXiv:2505.10465. https://arxiv.org/abs/2505.10465
Marcus, G. (2022). Deep Learning Is Hitting a Wall. Nautilus. https://nautil.us/deep-learning-is-hitting-a-wall-238440/
Risko, E. F., & Gilbert, S. J. (2016). Cognitive offloading. Trends in Cognitive Sciences, 20(9), 676–688. https://doi.org/10.1016/j.tics.2016.07.002
UNESCO. (2023). Guidance for generative AI in education and research. UNESCO. https://unesdoc.unesco.org/ark:/48223/pf0000386693
Xing, W. (2025, September). Why OpenAI’s solution to AI hallucinations would kill ChatGPT tomorrow. The Conversation. https://theconversation.com/why-openais-solution-to-ai-hallucinations-would-kill-chatgpt-tomorrow-265107