What's Actually Happening When You Talk to ChatGPT
· by Michael Doornbos · 2639 words
Every few days someone asks me how large language models work. The answers they’ve gotten usually fall into two camps: “it’s just autocomplete” (dismissive to the point of uselessness) or a wall of linear algebra that helps nobody outside a machine learning lab.
I’ve been trying to explain how AI and LLMs work to ordinary people, and I’m not great at it yet. So I figured I’d write out a plain-language reference for myself first. Did I succeed at making this accessible for the everyday person in 2,500 words? Maybe not. But it’s a start. Summarizing it for myself as a technologist might help me come up with a way to explain it to my Mom.
So here goes. No math. Just the concepts.
It predicts the next word
When people say “AI” right now, they almost always mean a large language model, or LLM. These aren’t the same AI that researchers have been working on for decades. Traditional AI was about writing explicit rules: if this, then that. Chess engines, expert systems, decision trees. LLMs are different. Nobody writes rules. Instead, you feed the model an enormous amount of text and let it figure out the patterns on its own.
And the core of what it figures out is simple: given some text, predict what comes next. Not “understand.” Not “think about.” Predict.
If you type “the cat sat on the,” the model assigns probabilities to every possible next word it knows. “Mat” gets a high probability. “Quantum” gets a low one. The model picks from those probabilities and produces a word. Then it takes everything so far, including the word it just generated, and predicts the next one. And the next. And the next.
Every conversation you’ve had with ChatGPT, Claude, or any other LLM is this process running in a loop. The model generates one word (or piece of a word) at a time, each one informed by everything that came before it.
The catch is that the model has no way to distinguish good text from bad. It trained on all of it: the well-architected open source library and the copy-pasted Stack Overflow answer that never actually worked. The insightful blog post and the SEO spam. It learns what’s probable, not what’s right.
That’s the whole trick. The rest of this article is about how you get from “predict the next word” to something that feels like intelligence.
Text becomes numbers
Computers don’t understand words. They understand numbers. So the first step is converting text into something a computer can work with.
LLMs do this through tokenization. A tokenizer breaks text into chunks called tokens, which are usually common words or pieces of words. The word “unhappiness” might become two tokens: “un” and “happiness.” Common words like “the” are usually a single token. Rare or long words get split into smaller pieces.
Each token maps to a number. The model doesn’t see “the cat sat on the mat.” It sees something like [1996, 5782, 3352, 319, 1996, 13249]. Every LLM has a vocabulary of unique tokens, typically 30,000 to 100,000 entries. Think of it as the model’s dictionary.
That dictionary is small. But the amount of text you can write with it is unlimited, the same way English has a finite number of words but an infinite number of possible sentences. A frontier model (industry term for the biggest models from OpenAI, Anthropic, Google, etc.) might train on 10 trillion or more tokens of text, all built from that same compact vocabulary.
This is why LLMs are bad at counting letters. When you ask “how many r’s are in strawberry,” the model never sees individual letters. It sees tokens, and the boundaries between tokens don’t line up with the boundaries between letters. The word “strawberry” might be two tokens that split in an unintuitive place.
How numbers carry meaning
A list of token numbers isn’t enough. The number 1996 for “the” doesn’t tell the model anything about what “the” means or how it relates to other words. We need something better.
This is where embeddings come in. But first, you need to know what a vector is. A vector is just a list of numbers. Your GPS location is a simple vector: two numbers (latitude and longitude) that pinpoint where you are on a surface. Add altitude and you have three numbers that locate you in three-dimensional space. A vector with a thousand numbers works the same way, just in a space with a thousand dimensions. Humans have a hard time seeing past three dimensions, but the math doesn’t care.
Each token gets its own vector, typically over a thousand values long. This is called an embedding. Each position captures something about what the token means.
During training, tokens that appear in similar contexts drift toward each other in this space. “Dog” and “cat” end up near each other because they show up in similar sentences. “King” and “queen” land nearby for the same reason.
This produces relationships that feel almost like real meaning. The classic example: take the embedding for “king,” subtract “man,” add “woman,” and you end up close to “queen.” The model has no concept of gender or royalty. It just learned patterns in how these words relate to each other, and those patterns happen to mirror the real world.
From here on out, the model is doing math on these vectors, not on words.
How it understands context
This is the part that made modern LLMs possible.
Before 2017, language models processed words in order, one at a time, like reading left to right. This made them slow and forgetful. By the time they reached the end of a long paragraph, they’d mostly forgotten the beginning.
Then a team at Google published a paper called “Attention Is All You Need” that changed everything. They introduced the transformer (the T in GPT), and its big idea was the attention mechanism.
Attention lets the model look at all the words at once and figure out which ones matter to each other. When processing the word “it” in a sentence, the model can look back to the noun “it” refers to, even if that noun was fifty words ago.
Consider these two sentences:
- “The bank by the river was eroding.”
- “The bank approved my loan application.”
Same word, completely different meaning. An attention-based model handles this naturally. When processing “bank” in the first sentence, the model pays attention to “river” and “eroding,” which push it toward the waterway meaning. In the second sentence, “approved” and “loan” push it toward the financial meaning. The model doesn’t have two definitions stored somewhere. It figures out which one you mean from context, every time.
Transformer models stack many layers of attention on top of each other. This is where the “transform” in “transformer” comes from. Each layer takes the vectors from the previous layer and refines them.
Early layers handle basic grammar. Middle layers pick up on things like subject-verb agreement across long distances. Later layers capture tone, intent, and topic. By the final layer, what started as a rough sketch of a word’s meaning has been sharpened into something that captures what this word means here, in this sentence, surrounded by these other words. That’s what the model uses to predict the next token.
This is also what makes LLMs expensive to run. Every token has to pay attention to every other token. Double the input length, and you quadruple the work. This is why context windows, the maximum amount of text a model can consider at once, have limits, and why running these models takes serious hardware.
So that’s the architecture: tokens, vectors, attention, layers. But how does a model get good at any of this?
Building an LLM: two phases of training
First, what does “training” actually mean? It’s not like teaching a person.
A model starts as a massive collection of numbers, billions of them, set to random values. These numbers are called parameters, and they control everything: the embedding vectors, the attention weights, the transformation at every layer. At the start, the model is useless. It produces gibberish because all those parameters are random noise.
Training is the process of adjusting those parameters so the model gets better at its task. Show it an example. It makes a prediction. Measure how wrong it was. Nudge the parameters slightly in a direction that would have been less wrong. Repeat. Billions of times. The model never gets told “this is how grammar works” or “Paris is the capital of France.” It just gets trillions of chances to be slightly less wrong at predicting the next token. Knowledge falls out of that process.
Phase one: pretraining
The model reads an enormous amount of text. Books, websites, code repositories, academic papers, forums, documentation. For a model like GPT-4 or Claude, this is trillions of tokens, a huge chunk of all the text ever published on the internet.
Here’s what’s wild: by just learning to predict the next token, the model picks up grammar, facts, reasoning patterns, coding conventions, and more. Nobody taught it any of these things. They emerged on their own, because understanding grammar and facts makes you better at predicting what comes next.
Phase two: alignment
A raw pretrained model is not very useful as an assistant. It’s been trained to predict text, so if you type a question, it might just generate more questions, because on the internet, questions are often followed by more questions.
The second phase goes by a few names: fine-tuning, RLHF (reinforcement learning from human feedback), alignment. Whatever you call it, the process is straightforward. Human reviewers rate the model’s outputs, and the model is trained to produce responses that humans prefer: helpful, honest, not harmful.
This is why different LLMs can feel so different to talk to. The base prediction engine might be similar, but the alignment training gives each one its personality. Why Claude sounds different from ChatGPT. Why some models hedge everything and others give you straight answers.
Why it costs hundreds of millions of dollars
Training a frontier LLM requires thousands of GPUs running continuously for weeks or months. GPUs are the specialized processors originally designed for rendering video game graphics, now the workhorses of AI. A single NVIDIA H100 costs around $30,000. A training cluster might use 10,000 to 30,000 of them. The electricity bill alone can reach tens of millions of dollars. Add cooling systems, high-speed networking, storage, and the engineering team to keep it all running.
And that’s just pretraining. Alignment requires its own compute plus the cost of human reviewers providing feedback on thousands of model outputs. Then there’s data curation: assembling, cleaning, and filtering trillions of tokens of training data. That’s not cheap either.
This is why only a handful of organizations can build frontier models from scratch. It’s not just a technical challenge. It’s an industrial one, and the question of who gets to build these things and why is worth thinking about.
Rules of thumb
So what does an LLM actually learn? It doesn’t learn rules. Nowhere in the model is there a line that says “sentences end with periods” or “Python functions start with def.” There’s no grammar textbook in there. No dictionary. No list of facts.
What it learns are heuristics. Christopher Mims’ How to AI has a great way of putting it: the best translation of “heuristic” is simply “rule of thumb.”
You already use rules of thumb every day, and you developed them the same way: through repetition. A kid touches a hot stove once and learns a rule of thumb about stoves. A doctor sees thousands of patients and develops an instinct that “this looks like pneumonia” before running any tests. An experienced driver starts braking before consciously registering that a light has turned yellow. A good cook knows the onions are done by the sound, not the timer.
Nobody sat these people down and gave them a rulebook. They built their intuitions through exposure, trial and error, and lots of repetition. The rules aren’t written down anywhere. They live in the patterns your brain has absorbed over time. Some are great. Some are wrong. That’s how rules of thumb work.
LLMs do the same thing, just at a massive scale. After seeing trillions of examples of text, the model develops its own rules of thumb:
- Questions are usually followed by answers, not more questions.
- Code after “def” in Python should look like a function definition.
- Formal language should be met with formal language.
- When someone says “explain like I’m five,” use short sentences and simple analogies.
- A sentence that starts with “However” is probably going to contradict what came before.
- If someone asks about a city, they probably want to know the country it’s in too.
- An email that starts with “Dear” is more formal than one that starts with “Hey.”
This is the key to understanding why LLMs are so capable and so confidently wrong in the same conversation. Their rules of thumb are usually good. Sometimes surprisingly good. But they’re still just rules of thumb, not knowledge, not understanding. Just patterns that usually work.
Why LLMs do what they do
Once you understand all this, a lot of LLM behavior stops being mysterious.
Why they’re great at code. GitHub alone has hundreds of millions of public repositories. LLMs have seen more code than any human ever will. Common patterns, standard library usage, typical project structures: these are exactly the kind of things next-token prediction nails. The problem has been solved a thousand times in the training data, and the model has seen all thousand solutions.
Why they hallucinate. The model generates whatever comes next with the highest probability. It has no way to check whether something is true, only whether it sounds right. A confident, detailed, completely wrong answer can be more probable than “I’m not sure.”
Why they struggle with math and logic. Math requires exact, step-by-step computation. Token prediction is approximate and statistical. An LLM can pattern-match its way through simple arithmetic because it’s seen plenty of examples, but genuinely novel math problems require actual computation, not prediction. This is why companies are building tools and “agents” around LLMs. The model recognizes it needs to do something precise and hands the work off to something that can: a calculator for math, a code interpreter for logic, a database query for facts, a web search for current information, an API call for live data. The model is good at knowing what needs to happen. It’s bad at doing the precise parts itself. So you let it orchestrate and let real tools execute.
Why they’re bad at spatial reasoning. The training data is text. Spatial relationships, visual layouts, physical intuition: none of that translates well into token sequences. The model has read millions of descriptions of rooms but has never been in one.
Why they sometimes seem to reason. The training data is full of humans working through problems step by step. When an LLM produces what looks like a chain of thought, it’s generating text that follows those patterns. Sometimes the reasoning is correct. Sometimes it just looks correct, because the model is predicting plausible-looking steps, not actually working anything out.
Prediction isn’t understanding
LLMs are the most capable text prediction engines ever built, and it turns out that really good text prediction can look a lot like understanding. I don’t think it is understanding. But it doesn’t have to be to get real work done.
They’re not magic. They’re not conscious. They’re very good at predicting the next word. Everything impressive about them, and everything broken, follows from that one fact.
Now I need to figure out how to explain all of this to my Mom.
Did this help you understand LLMs better? What would you explain differently?