← All Articles

When Writing Changes Voice and Statistics Listen

How a 2001 method for comparing corpora became a detector for AI-generated text, pasted content, and ghostwriters.

Here's something odd about human writing: it's supposed to be inconsistent.

Not wrong-inconsistent. Naturally inconsistent. Your word choices drift as you fatigue. Your sentence rhythms shift as topics evolve. The "you" writing at 2 AM after three rewrites sounds different from the "you" who started at 9 AM with fresh coffee and dangerous optimism.

This inconsistency leaves a fingerprint. And in 2001, a computational linguist named Adam Kilgarriff gave us the mathematics to measure it.

A man in a suit examining a giant manuscript on the floor with a magnifying glass, discovering fingerprint-like swirl patterns embedded in the text
Somewhere around page 7, the prints changed.

When the Voice Shifts

Consider a 10,000-word document. Single byline. Professional formatting. But somewhere around page 7, something changes. The vocabulary shifts. The rhythm stutters. The confident assertions give way to hedged qualifications.

What happened?

Maybe nothing. Maybe the author was tired. Maybe they circled back after a week and couldn't quite find the voice again.

Or maybe pages 7-12 were written by someone else entirely. A ghostwriter. An AI. Content pasted from another source. The kind of thing that matters enormously in academic integrity, legal forensics, and increasingly, in distinguishing human creativity from machine generation.

The question isn't "who wrote this?" It's subtler: "Did the same voice write all of this?"

Adam Kilgarriff (1960-2015)

1960 - 2015

Kilgarriff was a computational linguist at the University of Brighton and later Lexical Computing Ltd, where he created the Sketch Engine, a corpus analysis tool used by lexicographers worldwide. His 2001 paper "Comparing Corpora" introduced the chi-squared method that underlies this work.

He passed away unexpectedly in 2015, leaving behind foundational contributions to corpus linguistics, word sense disambiguation, and the infrastructure that powers modern dictionary-making. His work continues to influence how we understand language at scale.

Kilgarriff's insight was deceptively simple: if two texts come from the same author (or the same "population" of language), their word frequency distributions should be statistically similar. If they're different, the chi-squared test will catch it.

He wasn't thinking about AI detection. He was thinking about corpora, large collections of text that lexicographers use to understand how words actually behave in the wild. His method could tell you whether two newspaper archives came from the same publication tradition, or whether a collection of Renaissance plays showed consistent authorial style.

The math he chose, chi-squared, had been around since Karl Pearson proposed it in 1900. Kilgarriff's contribution was showing how to apply it to word frequencies in a way that worked for texts of unequal length and that identified which words drove the difference.

The Original Algorithm

Kilgarriff's method compares two texts by asking: "If these texts came from the same underlying language distribution, how surprised should we be by the word frequency differences we observe?"

The algorithm:

  1. Combine both texts into a joint corpus. This establishes a baseline of what "normal" looks like for these texts combined.
  2. Extract the N most frequent words. Common words (the, and, of, to) carry the most stylistic signal. Kilgarriff recommended 500.
  3. For each word, compute expected vs. observed frequencies. If text A is 60% of the joint corpus, we'd expect 60% of each word's occurrences to appear in text A.
  4. Apply the chi-squared formula. Sum up the squared differences between observed and expected, normalized by expected.
χ² = Σ (O - E)² / E

Lower chi-squared means more similar. Higher means more different. The resulting number isn't meaningful in isolation. It's meaningful in comparison. Text A vs. B gives χ² = 45. Text A vs. C gives χ² = 312. A and B are more alike.

What makes this useful for stylometry is what chi-squared captures: not just vocabulary overlap, but proportional usage patterns. Two authors might both use "however." But one uses it once per 500 words; another once per 2,000. Kilgarriff's method catches that.

Borrowing the Lens

Kilgarriff designed his method to compare separate texts. But what if you don't have two texts? What if you have one document and you want to know whether it's internally consistent?

The adaptation is straightforward: turn one document into many.

Take a long document, say 10,000 words. Slide a window across it, extracting chunks of 1,000 words each. Move the window by 500 words at a time (50% overlap), creating a sequence of overlapping samples:

Window 1:  words 0-999
Window 2:  words 500-1499
Window 3:  words 1000-1999
Window 4:  words 1500-2499
...
        

Now compare adjacent windows using Kilgarriff's chi-squared. Window 1 vs. 2. Window 2 vs. 3. Window 3 vs. 4. Plot the results.

What you get is a drift curve, a time series of stylistic distance measurements across the document. And that curve tells stories.

What the Patterns Mean

After analyzing the chi-squared curves from many documents (human-written, AI-generated, collaborative, heavily edited) four distinct patterns emerge:

The greatest stylistic drift is the lack of stylistic drift.

Here's what the writing process actually looks like: you stare at a blank page, write something mediocre, hate it, rewrite it, hate it less, keep going anyway. As Joakim Book observes, writers "gravitate between thinking that all they write is nonsense and that every word is golden." The finished product is a "mumbling mess of half-baked sentences" that somehow coheres through sheer persistence.

This is not a bug. This is how human writing happens. You must accept imperfection to write at all. And that acceptance leaves traces: the vocabulary shift when you came back after coffee, the rhythm change when you finally understood what you were trying to say, the slight inconsistency that proves a human wrestled with the words.

AI doesn't wrestle. It generates. And generation without struggle produces text that is too consistent, too smooth, too perfect. The absence of drift is itself the tell.

Five Fingerprints

Theory is one thing. Data is another. To test whether these four patterns actually emerge in practice, we ran Kilgarriff's method across five texts: four novels by human authors spanning two centuries of English prose, and one 20,000-word essay generated by ChatGPT.

Each text was divided into 1,000-word windows with 50% overlap. Each adjacent pair of windows was compared using the chi-squared formula above. The result: a distribution of drift measurements per author, visualized as box plots.

Interactive: Toggle between all five authors and human-only view. Hover for details.

The "All Five" view shows the scale problem immediately. ChatGPT's chi-squared values cluster around 18. The human authors cluster between 280 and 450. This is the "Suspiciously Uniform" pattern in action: the AI generates text where every window is statistically interchangeable with the next.

The "Human Authors Only" view reveals something subtler. All four human authors, Austen, Brontë, Dickens, and Tolkien, share nearly identical coefficients of variation: 0.071 to 0.079. Despite writing in different centuries, different genres, and different narrative modes, they all vary their function-word distributions by roughly 7-8% around their personal mean. The differences between them are in position (where the box sits), not in spread (how tall the box is).

This is the "Consistent" pattern. Not uniform. Not erratic. Consistently variable, within a bandwidth that appears to be a property of sustained human prose.

The Individual Stories

The box plots show where each author sits. But each text has its own drift curve, its own narrative, its own moments where the statistics spike or flatten. These case studies walk through the details.

Interactive: Browse case studies and explore each author's drift signature in detail.

The Implementation

The drift detection algorithm builds on Kilgarriff's core method with several practical additions:

Component Purpose
Sliding windows Creates overlapping chunks for smooth temporal resolution
Trend detection Linear regression identifies gradual drift patterns
Spike detection Statistical outlier identification for discontinuities
Variance analysis Coefficient of variation catches AI-like uniformity
Confidence scoring Degrades gracefully when data is marginal

The key parameters:

What This Catches (And What It Doesn't)

The drift detector excels at identifying:

It struggles with:

The Deeper Point

Kilgarriff built his method to answer a question about corpora: are these collections linguistically related? The adaptation for drift detection answers a different question: is this document linguistically coherent?

But both questions share an insight: writing carries signatures that statistics can read.

We don't always know we're leaving these traces. The slight shift in vocabulary when we're tired. The rhythm change when we pivot topics. The eerie consistency when a machine does our thinking for us. These patterns exist below conscious awareness, in the aggregate statistics of thousands of word choices.

Adam Kilgarriff gave us the mathematics to surface them. He was thinking about dictionaries and corpora, the practical infrastructure of understanding language at scale. He didn't anticipate a world where distinguishing human from machine writing would become urgent.

A child staring up at shelves of toys, with unique handcrafted animals on top and identical robots filling the lower shelves
Human writing is the top shelf.

But the tools he built turn out to be exactly what that world needs.


References

Kilgarriff, Adam. "Comparing Corpora." International Journal of Corpus Linguistics, vol. 6, no. 1, 2001, pp. 97-133. doi: 10.1075/ijcl.6.1.05kil

Eder, Maciej. "Does Size Matter? Authorship Attribution, Small Samples, Big Problem." Digital Scholarship in the Humanities, vol. 30, no. 2, 2015, pp. 167-182.

Juola, Patrick. "Authorship Attribution." Foundations and Trends in Information Retrieval, vol. 1, no. 3, 2006, pp. 233-334.

Programming Historian. "Introduction to Stylometry with Python." programminghistorian.org

Pearson, Karl. "On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 50, no. 302, 1900, pp. 157-175.