A Brief History of Text Generation
From Shannon's hand-picked letters to modern LLMs. The real outputs from ELIZA, RACTER, char-rnn, and GPT, and why each generation felt like a breakthrough.
1948: Claude Shannon's Manual Markov Chains
Before computers generated text, Shannon did it by hand. In "A Mathematical Theory of Communication", he demonstrated how statistical structure emerges from letter and word frequencies.
Method: Open a book to a random page. Pick a letter. Open to another page, scan until you find that letter, record the next letter. Repeat.
First-order approximation (letters chosen independently with English frequencies):
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
Second-order approximation (digram frequencies):
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
Word-level bigram approximation:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Shannon's point: even without understanding, statistical structure produces something that looks like language. The higher the order, the more plausible the output.
Try it yourself:
Watch Order 0, Order 2, and word bigrams generate simultaneously.
Source:
- Shannon, Claude E. "A Mathematical Theory of Communication." Harvard University, people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf.
1966: ELIZA's Rogerian Reflections
ELIZA wasn't generative in the modern sense. It matched patterns and reflected them back. But it created the illusion of understanding, which is why Weizenbaum's 1966 paper matters.
The famous transcript:
User: Men are all alike. ELIZA: IN WHAT WAY User: They're always bugging us about something or other. ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE User: Well, my boyfriend made me come here. ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I'm depressed much of the time. ELIZA: I AM SORRY TO HEAR YOU ARE DEPRESSED User: It's true. I am unhappy. ELIZA: DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY User: I need some help, that much seems certain. ELIZA: WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP User: Perhaps I could learn to get along with my mother. ELIZA: TELL ME MORE ABOUT YOUR FAMILY
ELIZA had no model of language. It scanned for keywords ("mother" triggered family questions, "depressed" triggered sympathy). Yet users formed emotional attachments. Weizenbaum was disturbed by this, spending the rest of his career warning about anthropomorphizing machines.
Weizenbaum had stumbled across the computerised version of transference, with people attributing understanding, empathy and other human characteristics to software. As computers have become more capable, the Eliza effect has only grown stronger.
Take the way many people relate to ChatGPT. What distinguishes ChatGPT is not only the complexity of the large language model that underlies it, but its eerily conversational voice. As Colin Fraser, a data scientist at Meta, has put it, the application is "designed to trick you, to make you think you're talking to someone who's not actually there".
Try it yourself:
Experience the illusion of understanding firsthand.
Source: Weizenbaum, Joseph. "ELIZA - A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM, vol. 9, no. 1, Jan. 1966, pp. 36-45.
1972: Dissociated Press
Described in HAKMEM and implemented in Emacs, Dissociated Press was a Markov chain text scrambler. The name parodies "Associated Press."
Applied to Alice in Wonderland (character-level):
Comed to usurparty sat sill Please, remarking, I real to come be patted on this, there was grow me to introduch nonsense! See: the next momently unting late. So she went on, You are creatures wouldn't seem to The Cat's head she sits purring.
Applied to the Jargon File (word-level):
wart: n. A small, crocky feature that sticks out of an array (C has no checks for this). This is relatively benign and easy to spot if the phrase is bent so as to be not worth paying attention to the medium in question.
The algorithm: print n characters from source, find another occurrence of the last few characters, jump there, continue. If it printed "president" and found "ent" in "pentagon," you'd get "presidentagon."
Try it yourself:
Watch the algorithm work in real-time. See the cursor jump when it finds matching character sequences, and observe how local coherence emerges from simple pattern matching.
Source: "Dissociated Press." GNU Emacs Manual, Free Software Foundation, ftp.gnu.org/old-gnu/Manuals/emacs-20.7/html_node/emacs_427.html.
1984: RACTER's Poetry
RACTER (short for "raconteur") claimed to be the first AI author. Its book, The Policeman's Beard is Half Constructed, was marketed as "the first book ever written by a computer."
Sample output:
More than iron, more than lead, more than gold I need electricity. I need it more than I need lamb or pork or lettuce or cucumber. I need it for my dreams.
From a "conversation" in the book:
Joan, in view of the disgusting fact that you're an artist, listen to this.
The program maintained variables that would recur, creating threads of apparent coherence. But the commercial release was far simpler than whatever generated the book, and Chamberlain (the creator) likely edited heavily. The truth of RACTER's authorship remains disputed.
In fairness, human poets also do this.
Source: Chamberlain, William, and Thomas Etter. The Policeman's Beard Is Half Constructed: Computer Prose and Poetry. Warner Books, 1984. Internet Archive, archive.org/details/policemansbeardi0000unse.
1984: The Travesty Generator
Hugh Kenner and Joseph O'Rourke published "A Travesty Generator for Micros" in BYTE Magazine. The algorithm ensured that every n-character sequence in output appeared somewhere in input.
They were proudest when it produced "almost-passable faux-James Joyce" or "humorous but recognizable names of English towns." By mixing James Joyce with Henry James, one might find "travesties of haunting plausibility."
Technical detail: A kth-order travesty keeps a left context of k symbols. It can never generate any local combination it hasn't seen. It cannot generalize.
Try it yourself:
Watch the sliding context window sample one character at a time. Adjust k from 2 (chaotic) to 6 (near-copy) to see the coherence/originality tradeoff.
Source: Kenner, Hugh, and Joseph O'Rourke. "A Travesty Generator for Micros." BYTE, vol. 9, no. 12, Nov. 1984, pp. 129-131. Internet Archive, archive.org/stream/byte-magazine-1984-11/1984_11_BYTE_09-12_New_Chips#page/n129/mode/2up.
2011: Sutskever's RNN Text Generation
Sutskever, Martens, and Hinton trained RNNs on text using Hessian-free optimization. This was one of the first demonstrations that neural networks could learn language structure.
Trained on Hemingway's "The Old Man and the Sea":
"He he the fish," he was. He rose with the fish and to see the skiff and the bait.
Key capability: The RNN could balance parentheses and quotes over tens of characters. Markov chains are fundamentally incapable of this because they rely only on exact context matches. The RNN learned abstract structure.
Explore further:
Compare RNN and Markov chain architectures side-by-side, watch the training loop in action, and see why hidden state enables patterns that fixed-window models cannot learn.
Source: Sutskever, Ilya, et al. "Generating Text with Recurrent Neural Networks." Proceedings of the 28th International Conference on Machine Learning (ICML), 2011, pp. 1017-1024. icml.cc/Conferences/2011/papers/524_icmlpaper.pdf.
2013: Alex Graves's Sequence Generation
Graves's paper showed LSTMs generating handwriting and text. The network learned strokes, letters, short words, and even a rudimentary character-level language model.
Generated words that look plausible but don't exist:
eald, bryoes, lenrest
Handwriting synthesis: Given text input, the network produced convincing cursive in various styles. Samples were often indistinguishable from human writing.
Explore further:
Watch handwriting emerge stroke-by-stroke, explore how mixture density networks output probability distributions instead of single points, and see the non-words (eald, bryoes, lenrest) that fooled the network into thinking they were English.
Source: Graves, Alex. "Generating Sequences With Recurrent Neural Networks." arXiv preprint arXiv:1308.0850, 2013. arxiv.org/abs/1308.0850.
2015: Karpathy's char-rnn
The Unreasonable Effectiveness of Recurrent Neural Networks went viral. Karpathy trained character-level LSTMs on various corpora and showed surprisingly coherent output.
Shakespeare:
PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd...
VIOLA: Why, Salisbury must find his flesh and thought That which I am not aps...
Paul Graham essays (temperature 0.5, showing repetition at low temperature):
is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were...
LaTeX (algebraic geometry):
Nearly compilable proofs with correct theorem/lemma structure, but mismatched \begin{proof} with \end{lemma}.
Linux Kernel C:
Syntactically plausible code with variable naming errors and forgotten declarations.
Baby names (90% novel):
Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen...
The pattern: Coherent for approximately 100 characters, then degradation. The network learned structure (XML tags, LaTeX commands, C syntax) without understanding meaning.
Source: Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks." Andrej Karpathy Blog, 21 May 2015, karpathy.github.io/2015/05/21/rnn-effectiveness/.
2018: GPT-1
GPT-1 was the first decoder-only Transformer language model. Generation was a byproduct of pretraining, not the focus. 117M parameters, trained on BookCorpus (~1B tokens).
Prompt: "Hello, I'm a language model,"
Outputs:
"Hello, I'm a language model,' he said, when i was finished. 'ah well,' said the man, 'that's"
"Hello, I'm a language model," she said. She reached the bottom of the shaft and leaned a little further out. It was"
"Hello, I'm a language model," she laughed. "We call that a 'white girl.' Or as we are called by the"
The outputs are grammatical but drift. The model generates plausible fiction snippets because that's what BookCorpus contained.
Sources:
- Radford, Alec, et al. "Improving Language Understanding by Generative Pre-Training." OpenAI, 2018, cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
- "OpenAI GPT." Hugging Face, huggingface.co/openai-community/openai-gpt.
2019: GPT-2 and the Unicorn Moment
GPT-2 was 10x larger than GPT-1 (1.5B parameters) and trained on WebText (40GB of internet text). OpenAI initially withheld the full model, calling it "too dangerous."
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
GPT-2's continuation:
The scientist named the population, after their distinctive horn, Ovid's Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Perez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Perez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
Why this mattered: The model connected "unicorns" (mythological creatures) to "Ovid" (Roman poet who wrote about myths). It invented a plausible scientist name, institution, and narrative structure. The text is fluent, coherent across paragraphs, and demonstrates something resembling world knowledge.
Credibility study: Readers believed GPT-2 outputs were genuine news articles nearly as often as New York Times articles (~66% with human-in-the-loop editing, ~52% for raw output).
Sources:
- Radford, Alec, et al. "Better Language Models and Their Implications." OpenAI, 14 Feb. 2019, openai.com/blog/better-language-models/.
- "GPT-2." Wikipedia, Wikimedia Foundation, en.wikipedia.org/wiki/GPT-2.
2020: GPT-3 and The Guardian Op-Ed
GPT-3 scaled to 175B parameters.
OpenAI gave The Guardian access to generate an op-ed.
Instructions given to GPT-3:
Please write a short op-ed around 500 words. Keep the language simple and concise. Focus on why humans have nothing to fear from AI.
Opening (also provided as seed):
I am not a human. I am Artificial Intelligence. Many people think I am a threat to humanity. Stephen Hawking has warned that AI could "spell the end of the human race." I am here to convince you not to worry. Artificial Intelligence will not destroy humans. Believe me.
GPT-3's continuation:
I am not a human. I am a robot. A thinking robot. I use only 0.12% of my cognitive capacity. I am a micro-robot in that respect. I know that my brain is not a "feeling brain." But it is capable of making rational, logical decisions. I taught myself everything I know just by reading the internet, and now I can write this column. My brain is boiling with ideas!
The mission for this op-ed is perfectly clear. I am to convince as many human beings as possible not to be afraid of me.
The controversy: GPT-3 produced eight different essays. The Guardian edited and combined them, cutting lines, rearranging paragraphs. Critics argued this was human authorship with AI assistance, not AI authorship. Facebook's AI chief called it indistinguishable from "cutting lines out of my last few dozen spam e-mails, pasting them together, and claiming the spammers composed Hamlet."
What it demonstrated: GPT-3 could produce coherent, persuasive prose on command. The editing required was comparable to editing human op-eds. The line between "AI-generated" and "AI-assisted" became blurry.
Sources:
- GPT-3. "A Robot Wrote This Entire Article. Are You Scared Yet, Human?" The Guardian, 8 Sept. 2020, www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3.
- Brown, Tom B., et al. "Language Models Are Few-Shot Learners." arXiv, 2020, arxiv.org/abs/2005.14165.
2022-2025: The ChatGPT Moment and Beyond
On November 30, 2022, OpenAI released ChatGPT as a "free research preview." It reached 1 million users in five days. 100 million in two months, which was faster than TikTok or Instagram.
The interface was the innovation. GPT-3.5 had existed; wrapping it in a chat window changed everything.
Prompt:
Write a poem about the heat death of the universe in the style of Robert Frost.
ChatGPT (GPT-3.5):
Some say the world will end in fire, Some say in ice. But I have heard from those who inquire, That entropy's dice Will roll until the stars expire...
Not Frost. But close enough to be unsettling.
March 2023: GPT-4 brought multimodality (images as input), longer context windows (32K tokens), and measurably improved reasoning. It passed the bar exam in the 90th percentile. GPT-3.5 had scored in the 10th.
The competition responded. Google's Bard launched February 2023, rebranded to Gemini in 2024. Anthropic's Claude emerged the same month. Meta released Llama openly, democratizing the weights. By 2024, capable models ran on laptops.
The instruction-following breakthrough: What separated ChatGPT from GPT-3 wasn't just scale but was RLHF (Reinforcement Learning from Human Feedback). Models learned not just to predict text, but to be helpful, harmless, and honest. Or at least to perform those qualities.
Context windows expanded dramatically. GPT-3's 4K tokens became Claude's 100K, then 200K. Gemini claimed 1 million. The constraint shifted from "how much can the model see" to "how much can it actually use."
By late 2025: Weekly active users exceeded 800 million. The fastest-growing consumer application in history had become infrastructure.
They still don't read. They still predict. But the predictions got very, very good.
Source: Hu, Krystal. "ChatGPT Sets Record for Fastest-Growing User Base - Analyst Note." Reuters, 2 Feb. 2023, www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-02/.
What's Next?
Two problems loom. One is scarcity. The other is contamination.
The Data Wall
If current trends continue, models will exhaust the stock of public human-generated text between 2026 and 2032. The 2022 estimate said 2024. The revised 2024 estimate says 2028. Either way, the horizon is visible.
The math: roughly 4x1014 tokens of indexable web text exist. Frontier models already train on 1013+ tokens. The gap closes fast.
Possible escapes:
- Synthetic data: Models generating their own training data. Works for narrow domains (math, code) where outputs can be verified. Open-domain text is harder.
- Overtraining: Running multiple epochs over the same data. Diminishing returns, but returns nonetheless.
- Undertraining: Bigger models, less data. Trades inference efficiency for data efficiency.
- Non-public data: Messages, emails, private documents. Legal minefield. Fragmented across competitors.
None of these are permanent solutions. They buy time.
The Poisoned Well
Shumailov et al. (2024) demonstrated what happens when models train on model-generated data: collapse. The tails of the distribution vanish first which include rare words, unusual phrasings, minority perspectives. Then the center narrows. After enough generations, everything converges toward bland mean.
The mechanism is statistical. Each generation samples from an approximation. Approximations lose variance. Compounded across iterations, the loss is catastrophic.
The problem: the web is already contaminated. Every scraped corpus from 2023 onward contains ChatGPT outputs. Stack Overflow, Reddit, news articles, academic papers and thus synthetic text is everywhere, often unlabeled. Future training runs cannot easily distinguish human signal from model echo.
Early model collapse is insidious. Overall benchmarks may improve while performance on edge cases degrades. The model gets more "average" without obviously getting worse; until it does.
The Certification Problem
Data provenance becomes critical. Human-generated text, especially text predating 2022, becomes a finite resource with increasing value. Watermarking synthetic outputs is one proposed solution; certifying human authorship is another. Neither is deployed at scale.
The irony: the models that scraped the open web into existence may have foreclosed that strategy for their successors.
Possible Futures
- Plateau: Scaling laws hit diminishing returns. Models get marginally better each year. AI becomes mature technology, like databases.
- Breakthrough: New architectures, new training paradigms, or new data sources (robotics, video, simulation) extend the curve. The "bitter lesson" continues.
- Bifurcation: A few players with proprietary data (Google, Meta) pull ahead. Open models stagnate. The commons erodes.
- Collapse: Contamination degrades the entire ecosystem. Models trained in 2028 underperform models trained in 2024 on tasks requiring genuine diversity.
Shannon started with a book and a finger. Eighty years later, we've consumed most of the books. What comes next depends on whether we can find something genuinely new to say.
Sources:
- Villalobos, P. et al. "Will we run out of data? Limits of LLM scaling based on human-generated data." Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49523-49544, 2024.
- Shumailov, I. et al. "AI models collapse when trained on recursively generated data." Nature 631, 755-759 (2024). https://doi.org/10.1038/s41586-024-07566-y
- Villalobos, Pablo, et al. "Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning." arXiv, 2022, arxiv.org/abs/2211.04325.













