When Writing Changes Voice and Statistics Listen

How a 2001 method for comparing corpora became a detector for AI-generated text, pasted content, and ghostwriters.

Here's something odd about human writing: it's supposed to be inconsistent.

Not wrong-inconsistent. Naturally inconsistent. Your word choices drift as you fatigue. Your sentence rhythms shift as topics evolve. The "you" writing at 2 AM after three rewrites sounds different from the "you" who started at 9 AM with fresh coffee and dangerous optimism.

This inconsistency leaves a fingerprint. And in 2001, a computational linguist named Adam Kilgarriff gave us the mathematics to measure it.

A man in a suit examining a giant manuscript on the floor with a magnifying glass, discovering fingerprint-like swirl patterns embedded in the text — Somewhere around page 7, the prints changed.

When the Voice Shifts

Consider a 10,000-word document. Single byline. Professional formatting. But somewhere around page 7, something changes. The vocabulary shifts. The rhythm stutters. The confident assertions give way to hedged qualifications.

What happened?

Maybe nothing. Maybe the author was tired. Maybe they circled back after a week and couldn't quite find the voice again.

Or maybe pages 7-12 were written by someone else entirely. A ghostwriter. An AI. Content pasted from another source. The kind of thing that matters enormously in academic integrity, legal forensics, and increasingly, in distinguishing human creativity from machine generation.

The question isn't "who wrote this?" It's subtler: "Did the same voice write all of this?"

Adam Kilgarriff (1960-2015)

Adam Kilgarriff

1960 - 2015

Kilgarriff was a computational linguist at the University of Brighton and later Lexical Computing Ltd, where he created the Sketch Engine, a corpus analysis tool used by lexicographers worldwide. His 2001 paper "Comparing Corpora" introduced the chi-squared method that underlies this work.

He passed away unexpectedly in 2015, leaving behind foundational contributions to corpus linguistics, word sense disambiguation, and the infrastructure that powers modern dictionary-making. His work continues to influence how we understand language at scale.

Kilgarriff's insight was deceptively simple: if two texts come from the same author (or the same "population" of language), their word frequency distributions should be statistically similar. If they're different, the chi-squared test will catch it.

He wasn't thinking about AI detection. He was thinking about corpora, large collections of text that lexicographers use to understand how words actually behave in the wild. His method could tell you whether two newspaper archives came from the same publication tradition, or whether a collection of Renaissance plays showed consistent authorial style.

The math he chose, chi-squared, had been around since Karl Pearson proposed it in 1900. Kilgarriff's contribution was showing how to apply it to word frequencies in a way that worked for texts of unequal length and that identified which words drove the difference.

The Original Algorithm

Kilgarriff's method compares two texts by asking: "If these texts came from the same underlying language distribution, how surprised should we be by the word frequency differences we observe?"

The algorithm:

Combine both texts into a joint corpus. This establishes a baseline of what "normal" looks like for these texts combined.
Extract the N most frequent words. Common words (the, and, of, to) carry the most stylistic signal. Kilgarriff recommended 500.
For each word, compute expected vs. observed frequencies. If text A is 60% of the joint corpus, we'd expect 60% of each word's occurrences to appear in text A.
Apply the chi-squared formula. Sum up the squared differences between observed and expected, normalized by expected.

χ² = Σ (O - E)² / E

Lower chi-squared means more similar. Higher means more different. The resulting number isn't meaningful in isolation. It's meaningful in comparison. Text A vs. B gives χ² = 45. Text A vs. C gives χ² = 312. A and B are more alike.

What makes this useful for stylometry is what chi-squared captures: not just vocabulary overlap, but proportional usage patterns. Two authors might both use "however." But one uses it once per 500 words; another once per 2,000. Kilgarriff's method catches that.

Borrowing the Lens

Kilgarriff designed his method to compare separate texts. But what if you don't have two texts? What if you have one document and you want to know whether it's internally consistent?

The adaptation is straightforward: turn one document into many.

Take a long document, say 10,000 words. Slide a window across it, extracting chunks of 1,000 words each. Move the window by 500 words at a time (50% overlap), creating a sequence of overlapping samples:

Window 1:  words 0-999
Window 2:  words 500-1499
Window 3:  words 1000-1999
Window 4:  words 1500-2499
...

Now compare adjacent windows using Kilgarriff's chi-squared. Window 1 vs. 2. Window 2 vs. 3. Window 3 vs. 4. Plot the results.

What you get is a drift curve, a time series of stylistic distance measurements across the document. And that curve tells stories.

What the Patterns Mean

After analyzing the chi-squared curves from many documents (human-written, AI-generated, collaborative, heavily edited) four distinct patterns emerge:

Consistent

Low χ² values with moderate variance

This is what healthy human writing looks like. The numbers fluctuate (because humans fluctuate) but they stay within a stable band. No dramatic spikes. No eerie flatlines. Just the natural rhythm of a single voice working through a document.

Suspiciously Uniform

Near-zero variance in χ² values

This is the AI signature. Large language models maintain stylistic consistency that humans rarely achieve. When a drift curve shows almost no variation, every window comparing nearly identically to its neighbors, that's not natural. Human writing breathes. AI writing holds its breath.

Gradual Drift

Steadily increasing χ² over time

The document gets progressively more different from how it started. Common causes: author fatigue, topic evolution (technical vocabulary accumulates), or revision asymmetry (you heavily edited the beginning but ran out of energy by the end).

Sudden Spike

One comparison shows χ² far exceeding the mean

Something changed dramatically at a specific point. This is the ghostwriter signature. The pasted-content fingerprint. The "my coauthor wrote this section" tell. The spike pinpoints exactly where the discontinuity occurs.

The greatest stylistic drift is the lack of stylistic drift.

Here's what the writing process actually looks like: you stare at a blank page, write something mediocre, hate it, rewrite it, hate it less, keep going anyway. As Joakim Book observes, writers "gravitate between thinking that all they write is nonsense and that every word is golden." The finished product is a "mumbling mess of half-baked sentences" that somehow coheres through sheer persistence.

This is not a bug. This is how human writing happens. You must accept imperfection to write at all. And that acceptance leaves traces: the vocabulary shift when you came back after coffee, the rhythm change when you finally understood what you were trying to say, the slight inconsistency that proves a human wrestled with the words.

AI doesn't wrestle. It generates. And generation without struggle produces text that is too consistent, too smooth, too perfect. The absence of drift is itself the tell.

Five Fingerprints

Theory is one thing. Data is another. To test whether these four patterns actually emerge in practice, we ran Kilgarriff's method across five texts: four novels by human authors spanning two centuries of English prose, and one 20,000-word essay generated by ChatGPT.

Each text was divided into 1,000-word windows with 50% overlap. Each adjacent pair of windows was compared using the chi-squared formula above. The result: a distribution of drift measurements per author, visualized as box plots.

Interactive: Toggle between all five authors and human-only view. Hover for details.

The "All Five" view shows the scale problem immediately. ChatGPT's chi-squared values cluster around 18. The human authors cluster between 280 and 450. This is the "Suspiciously Uniform" pattern in action: the AI generates text where every window is statistically interchangeable with the next.

The "Human Authors Only" view reveals something subtler. All four human authors, Austen, Brontë, Dickens, and Tolkien, share nearly identical coefficients of variation: 0.071 to 0.079. Despite writing in different centuries, different genres, and different narrative modes, they all vary their function-word distributions by roughly 7-8% around their personal mean. The differences between them are in position (where the box sits), not in spread (how tall the box is).

This is the "Consistent" pattern. Not uniform. Not erratic. Consistently variable, within a bandwidth that appears to be a property of sustained human prose.

The Individual Stories

The box plots show where each author sits. But each text has its own drift curve, its own narrative, its own moments where the statistics spike or flatten. These case studies walk through the details.

Interactive: Browse case studies and explore each author's drift signature in detail.

The Implementation

The drift detection algorithm builds on Kilgarriff's core method with several practical additions:

Component	Purpose
Sliding windows	Creates overlapping chunks for smooth temporal resolution
Trend detection	Linear regression identifies gradual drift patterns
Spike detection	Statistical outlier identification for discontinuities
Variance analysis	Coefficient of variation catches AI-like uniformity
Confidence scoring	Degrades gracefully when data is marginal

The key parameters:

Window size (default: 1,000 tokens): Larger windows give more stable chi-squared but fewer comparisons. Smaller windows give finer resolution but noisier measurements.
Stride (default: 500 tokens): How far to advance between windows. Stride equal to window size gives non-overlapping chunks. Stride at half window size gives 50% overlap and smoother curves.
Top N words (default: 500): How many high-frequency words to include. Kilgarriff's original recommendation. More words means finer discrimination but requires longer texts.

What This Catches (And What It Doesn't)

The drift detector excels at identifying:

Multi-author documents: When someone else wrote a section, the style shift is measurable.
Pasted content: Content copied from another source disrupts the stylistic continuity.
AI-generated text: The uncanny uniformity of LLM output stands out against human variation.
Heavy editing asymmetry: When parts of a document were revised more than others.

It struggles with:

Short texts: Chi-squared needs volume. Documents under 3,000 words rarely produce enough windows for reliable analysis.
Intentional style shifting: A novelist who deliberately changes voice for different characters will trigger false positives.
Perfectly consistent human writers: They exist, occasionally. The method might flag them as suspiciously uniform.
Sophisticated AI that mimics variance: As models improve at simulating human inconsistency, this signal will weaken.

The Deeper Point

Kilgarriff built his method to answer a question about corpora: are these collections linguistically related? The adaptation for drift detection answers a different question: is this document linguistically coherent?

But both questions share an insight: writing carries signatures that statistics can read.

We don't always know we're leaving these traces. The slight shift in vocabulary when we're tired. The rhythm change when we pivot topics. The eerie consistency when a machine does our thinking for us. These patterns exist below conscious awareness, in the aggregate statistics of thousands of word choices.

Adam Kilgarriff gave us the mathematics to surface them. He was thinking about dictionaries and corpora, the practical infrastructure of understanding language at scale. He didn't anticipate a world where distinguishing human from machine writing would become urgent.

A child staring up at shelves of toys, with unique handcrafted animals on top and identical robots filling the lower shelves — Human writing is the top shelf.

But the tools he built turn out to be exactly what that world needs.

References

Kilgarriff, Adam. "Comparing Corpora." International Journal of Corpus Linguistics, vol. 6, no. 1, 2001, pp. 97-133. doi: 10.1075/ijcl.6.1.05kil

Eder, Maciej. "Does Size Matter? Authorship Attribution, Small Samples, Big Problem." Digital Scholarship in the Humanities, vol. 30, no. 2, 2015, pp. 167-182.

Juola, Patrick. "Authorship Attribution." Foundations and Trends in Information Retrieval, vol. 1, no. 3, 2006, pp. 233-334.

Programming Historian. "Introduction to Stylometry with Python." programminghistorian.org

Pearson, Karl. "On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 50, no. 302, 1900, pp. 157-175.