Etienne’s Letter: From Pushkin to GPT
We talk about AI models as if they arrived from nowhere. They didn’t. They are the industrial-scale descendant of a 113-year-old technique for predicting what comes next.
It’s January 1913. A Russian mathematician named Andrey Markov is finishing up a fight with a colleague. The colleague has argued that probability theory only applies to events that are independent of one another, coin flips and dice rolls, and Markov is convinced this is rubbish. He picks an unlikely battlefield to prove it on: Alexander Pushkin’s Eugene Onegin, the verse novel every Russian schoolchild can recite by heart. He takes the first 20,000 letters of the poem, strips out the punctuation and the spaces, and by hand, with a pencil, classifies each letter as a vowel or a consonant and tallies which kind follows which. On January 23, he stands in front of the Imperial Academy of Sciences in St. Petersburg and presents his finding. A vowel in Pushkin’s poem is far more likely to be followed by a consonant than by another vowel. The letters are not independent. What comes next depends on what came before.
That paper invented what we now call Markov chains. It was also the first time anyone had built a probabilistic model of human language and proved it by hand.
A large language model, 113 years later, is the same idea at industrial scale. Given everything that has come before, what is the probability of each possible next token? Markov did it with pencil, paper, and two categories. A modern model does it with billions of parameters and a vocabulary of tens of thousands. The shape of the question hasn’t changed. The shape of the answer hasn’t either.
I find that lineage clarifying because it cuts the mysticism. We talk about these models as if they arrived from nowhere. They didn’t. They are the industrial-scale descendant of a 113-year-old technique for predicting what comes next. The interesting question is not whether the prediction is magic. It’s how the prediction is actually made, and that turns out to be something a CTO can hold in their head if it’s shrunk down far enough.
So I shrank it. Five words, five numbers per word, every table small enough to fit on one page. I want to walk you through it the way I built it, because I think any CTO making vendor and roadmap bets on this technology benefits from being able to reconstruct the mechanism from memory. Not the math of a production model. The shape of it.
The core idea: an LLM turns words into numbers, lets every word mix in a little of every other word according to how related they are, repeats that dozens of times, and uses the result to predict the next token. Everything else is scale.
Why the fog costs us
We aren’t training models from scratch. We don’t need to derive backpropagation. So why does a working CTO benefit from the internals?
Because we are committing budget, architecture, and hiring to a technology our leadership teams find mystifying, and mystification is a tax. It shows up when product asks for something the model structurally cannot do and we can’t articulate why. It shows up when a CEO reads one breathless essay and wants the roadmap rebuilt around it. It shows up when a senior engineer offers fine-tuning as the answer to every question and the room has nothing sharper than instinct to push back with.
When we can explain the machine plainly, the wrong things stop impressing us. Our questions about cost and latency get sharper, because we understand where they come from. And we become the person in the C-Suite who can translate, which the past few years have made clear is most of the job.
Here’s the page.



