1 Introduction

The public story right now goes something like this: AI has arrived, and it speaks in paragraphs.

In part, this is true but it’s deeply misleading.

It is true that large language models (LLMs) have made a class of systems widely available that can write, summarize, translate, and imitate many textual forms with striking fluency. It is misleading because the term “AI” is being used as if it refers to one thing — often a single product category — when in reality AI is a broader (and quite old) field with multiple traditions, tools, and failure modes.

The result is confusion at exactly the wrong level: people argue about “AI” while meaning LLMs, and then generalize the strengths (and the hype) of one method to domains where its weaknesses are structural.

LLMs are optimized to produce plausible text. Institutions keep treating that plausibility as if it were knowledge, judgment, or competence. And because the outputs are fluent, the category error becomes easy to miss.

2 The Category Error: Compressing a Field Into One Artifact

One reason this debate keeps looping is linguistic: we compress a large field into one artifact, and then we lose the ability to choose methods appropriately. “AI” becomes “the chatbot”. And once that compression is installed, everything looks like a nail.

A reasonable, yet simplified, hierarchy looks more like this:

venn diagram of ai
Figure 1: Simplified, non-exhaustive Venn-style diagram showing AI, ML, NLP, language models, and LLMs as specific subsets.
  • AI (Artificial Intelligence): systems that perform tasks associated with intelligent behavior
    • ML (Machine Learning): learning patterns from data
      • DL (Deep Learning): ML using multi-layer neural networks
        • Representation learning: learning useful features automatically
        • NLP (Natural Language Processing): language tasks (includes rule-based, statistical, and deep approaches)
          • LMs (Language Models): models of token sequences
            • LLMs: scaled-up language models

That hierarchy matters because it places LLMs where they belong: not as “what AI is”, but as a particular method family inside a particular subfield.

In parallel, AI also contains major approaches that do not reduce to “a big neural net that writes text”:

  • Symbolic AI & RBES (Rules-Based Expert Systems): explicit rules, logic, knowledge representation, inference
  • Computer Vision (CV): perception from images/video and OCR (classical geometry + modern DL)
  • Speech processing: phonetics, signal processing, acoustic modeling, pronunciation lexicons
  • Planning and control: search, optimization, robotics, constraint solving
  • Knowledge bases / knowledge graphs: structured facts and relations used for reasoning and retrieval
  • Classical statistical learning: logistic regression, CRFs, HMMs, Bayesian models, calibration and uncertainty

So yes: an LLM is indeed AI. But “AI” is not an LLM. Treating them as synonyms collapses crucial distinctions that matter exactly in institutional use:

  • constraints vs. generation,
  • inference vs. imitation,
  • verification vs. continuation,
  • determinism vs. sampling.

And once those distinctions disappear, the wrong tool starts looking like the obvious tool — because it is the only tool left in the story, buoyed by the gravitas of “science-fiction AI”.

3 What an LLM Is (And What It Is Doing When It “Answers”)

An LLM is a probabilistic sequence model trained to predict the next token in a text. That sounds abstract, but the mechanism is simple:

  • Given the text so far,
  • predict the next token.
  • Append it.
  • Repeat.

A token is usually not a word but a word-piece: “educa”, “tion”, punctuation, whitespace, fragments etc. Your prompt is converted into tokens, tokens are mapped to vectors (numbers), and then the model produces a probability distribution over what token should come next.

Two things matter here:

  1. The model is not optimized for truth, only for likelihood under training distribution.
  2. It is forced to produce some continuation even when the correct answer is “I don’t know”.

3.1 The Transformer: Attention, Not Verification

Most modern LLMs are transformer models. A simplified view:

  • Tokenization: text → tokens
  • Embedding: tokens → vectors in a high-dimensional space
  • Self-attention: each token can weight (attend to) other tokens in context
  • Feed-forward layers: non-linear transformations refine representations
  • Output head: converts the final representation into next-token probabilities

Self-attention is the key ingredient: it allows the model to use relevant context dynamically. But attention is not grounding. It is not a truth mechanism. It is a way to reuse patterns conditioned on context.

If the context contains an error, an unspoken assumption, or missing constraints, the model does not “notice” in the way a responsible reasoner must. It will still produce an answer — shaped continuation — often with rhetorical confidence.

3.2 Training: Competence in Language, Not Epistemic Discipline

Most LLMs are created in two main stages:

  1. Pre-training: minimize next-token prediction error on massive corpora (the model learns general textual structure and broad “knowledge-shaped” regularities). Depending on the size of the training data, this can take weeks to months.
  2. Post-training / alignment: fine-tuning to behave like a helpful assistant (supervised instruction tuning, preference optimization like RLHF/RLAIF, safety tuning).

This second stage matters socially: it makes output feel conversational, polite, and decisive. But it does not give the model a grounded mechanism for truth, source discipline, or responsibility.

The model becomes better at sounding like someone who knows. And this is key — the model is not sentient even if it may sound like it — it is only a plausibility engine. Try using a small model with a complex context, and the illusion breaks:

Given the prompt Explain what a box plot diagram is in a way that a toddler would understand it, while at the same time compare it to noxzbox and the model deepseek-r1:1.5b which have not been trained on what noxzbox is, this is the result:

Imagine you’re looking at an egg carton. It has different sections, like the top, middle, and bottom, each labeled with things you want inside, such as toys or chocolates. This is similar to how a box diagram works for a toddler. They see the big picture (the whole box) and then understand what goes in each part (like little boxes for colors or shapes).

Nozzi, known for his innovative graphic designs, used many complex boxes with detailed instructions. These might be like small blocks that a toddler can’t handle because they’re too intricate. So, while toddlers grasp the big picture with simple box diagrams, Nozzi’s method was more detailed and potentially overwhelming.

You get a confident idiot.

4 The Confident Illusion Is Not a Rare Failure — it Is the Baseline Leaking

“Hallucination” is a loaded term because it implies abnormality: a glitch, a deviation from an otherwise reliable system. In LLMs, fabricated claims are better understood as a structural consequence of what the system is built to do.

If the objective is “produce plausible continuation”, then under uncertainty the model will do what it is rewarded for: produce plausible continuation. In many workflows, “I don’t know” is treated as failure. Confidence sells. Uncertainty does not. So the system naturally drifts toward a dangerous property: it fills gaps with convincing narrative. So you see, so called “hallucination” isn’t something that can be “trained away” — it is inherit to how LLMs function.

This matters because institutions run on gaps: incomplete data, contested interpretation, time pressure, political constraints. Put a plausibility engine into that environment and you don’t get neutral assistance. You get a narrative generator that can be mistaken for analysis.

5 Generalized Systems on Specific Tasks: The Wrong Tool Can Be “Impressive” and Still Wrong

One reason LLMs spread is that they are general. They can do “a bit of everything”. That makes them tempting as a default interface: if you can pipe all tasks through one system, you reduce integration cost and decision-making friction.

But generality has a price:

  • Weak constraints: many tasks need deterministic behavior, not rhetorical “competence”.
  • Opaque failure modes: you yourself often cannot explain why the output is what it is.
  • High marginal compute: you pay a lot of inference for tasks that have cheap, exact solutions.
  • Epistemic slippage: people start asking the system for judgment because it can write like it has judgment.

So the argument is not necessary “never use LLMs”. It’s: think twice before using a generalized model for a specific, well-defined problem. If a task has strong structure, strong constraints, and known metrics, there is usually a better tool.

6 The Boring Alternative: Rules-Based Algorithms That Are Testable, Cheap, and Accountable

The most boring alternative is also the most radical: don’t use a tool at all. Stop. Think. Do the cognitive work yourself. Not because tools are evil, but because the thinking step is often the entire point of the task. In a lot of cases the “AI solution” is just a way of avoiding the part of the task that is the task.

The obvious aside: we also keep routing plain retrieval, matching, and normalization problems through LLM chat interfaces because it feels modern. But if the task is “match this string to the most likely entity”, or “normalize names”, or “find likely duplicates”, you do not need a generative model. You need a pipeline that is:

  • deterministic (or at least bounded)
  • auditable
  • measurable
  • cheap
  • explainable

The classic approach is not glamorous, but it works because it is constrained. Here is a concrete example: approximate string matching (names, places, product titles, misspellings), and a robust pipeline looks like this:

  1. Phonetic Encoding Compute a phonetic key for each string and store it in an index. Examples:

    • Soundex (simple, old, coarse; still useful in some contexts)
    • Metaphone / Double Metaphone (better for English phonetics)
    • Cologne phonetics (German-oriented)
    • Language-specific encoders

    Why it’s better: phonetic keys collapse spelling variation into a stable representation. They’re cheap to compute and easy to index.

  2. Index-Based Candidate Retrieval Use the phonetic key as a database index key to retrieve plausible candidates efficiently.

    Why it’s better: you don’t scan the world. You do targeted retrieval with predictable performance. The system’s behavior is inspectable (“why did we consider these candidates?”).

  3. Edit Distance Refinement Refine candidate matches using an edit-distance metric, such as:

    • Damerau–Levenshtein distance (counts insertions, deletions, substitutions, and transpositions — useful for real typos)

    Why it’s better: the distance gives you a measurable similarity score. You can set thresholds, measure false positives, and tune.

  4. Ranking under explicit criteria Rank candidates using:

    • frequency statistics (“most common entity wins”)
    • domain lexicons (known place names, customer lists, product catalog)
    • optional context models (non-generative, or even small local models) if you have structured context

    Why it’s better: ranking is explicit and can be justified. You can say, “we chose X because it was within distance 2 and had highest prior frequency in this domain”.

This is what “good AI” often looks like: not a single big model, but a constrained pipeline with tight failure modes and clear accountability. In other words: less lazy “magic”, and more engineering — but it works every time.

7 Resource Regimes: The Hidden Cost of Making “Ask the Model” the Default

Even if you ignore epistemology, there is a material argument against casual, ubiquitous LLM use. LLMs are not just software. They are an industrial stack:

  • specialized chips
  • data centers
  • cooling systems
  • power provisioning
  • complex global supply chains
  • and an adoption logic that pushes toward ever more inference

A single query may feel trivial, but normalization matters. When you make a system the default interface to everyday tasks, you create a baseline demand that scales with every workflow, every employee, every student, every public service. This is where “but it’s only a little bit of compute” becomes a political argument: small per-use costs multiplied by institutionalization become infrastructure.

So the question is not “does this query use much power?” but “what kind of society do we build if we make generalized inference an ambient layer of everything?“. That question becomes sharper when the tasks could be solved by cheap deterministic methods.

Using an LLM to do phonetic matching is not just overkill. It is resource-heavy substitution for something we can do precisely with modest compute, locally, under our control. And the same argument goes for pretty much everything LLMs are used for today.

8 The Governance Problem: Reliance on External Actors Is Not a Footnote

Here is the part that should make anyone uneasy, especially in public sector or education: When you rely on a commercial LLM, you rely on an external actor for a core cognitive layer. That has consequences (read more about a notoriously untrustworthy “external actor” here: Google Privacy Violations and False Claims):

  • your toolchain can change without consent (model behavior shifts, hidden updates)
  • pricing changes become operational risk
  • data handling becomes a trust problem
  • audit becomes “trust the vendor” dressed up as compliance
  • your institution inherits the vendor’s incentives

And perhaps most importantly: responsibility flows downward. When the model’s output causes harm, the vendor is abstract, the model is opaque, and the individual user is blamed for “using it wrong”. This is not a sustainable responsibility chain.

8.1 The Best-Case Boundary: Local Model, Local System, Local Control

If an institution insists on using LLMs, the best case is not “pick the nicest chatbot”. The best case is:

  • a free and open model (weights and license that allow independent use and inspection)
  • running in a system you control (hardware you own, operating system you own)
  • with explicit constraints (narrow task scope, retrieval, logging, evaluation, strict separation from decision authority)

This doesn’t magically solve epistemic problems — the model is still a plausibility engine — but it reduces two major risks:

  1. dependency on external actors as cognitive infrastructure
  2. uncontrolled behavioral drift through vendor updates

It also forces an honest accounting of compute. If you have to pay for the hardware and electricity yourself, you stop pretending that “LLM everywhere” is free.

9 Refuse the Category Error — And Refuse Outsourced Judgment

The reason LLMs are persuasive is not just that they work. It is that they speak like someone who works. Language is our primary marker for mind. When a system produces fluent text, we start treating it as if it participates in knowledge and judgment. That reflex is understandable — and dangerous.

So the stance I’m arguing for is not anti-technology. It is anti-confusion.

  • Don’t treat plausibility as truth.
  • Don’t treat fluent output as accountable judgment.
  • Don’t treat a generalized generative system as the default tool for specific, structured tasks.
  • And don’t build institutions that rely on external actors for the cognitive layer of society.

If you must use LLMs, the least bad path is: use models you control, in systems you control, under constraints you can test — and reserve them for places where their strengths are real (drafting, ideation, and interface work), not where their weakness is fatal (verification and judgment).

Otherwise we are not adopting “AI”. We are adopting the aesthetic of intelligence — outsourcing reasoning to a system optimized for plausible continuation, and then acting surprised when it produces plausible mistakes at scale.