← All Articles

Every Word Has a Price Tag

The word "the" should appear about 6,185 times in every 100,000 words of English. When it doesn't, something interesting is happening.

Open any novel. Count the word "however." Now count the words in the whole book.

Illustration of a figure examining a document with letters streaming from the page — Expected frequency: 14 per 100,000. Observed: 47. Case closed.

You'll find that "however" appears roughly 14 times per 100,000 words. Not 13. Not 16. Roughly 14. This holds across genres, across decades, across publishers. The British National Corpus, a 100-million-word snapshot of real English from the early 1990s, established this and thousands of other baselines.

When a word appears more or less often than this expectation, it tells you something. Usually something the author didn't intend to reveal.

What "Expected" Means

The British National Corpus (BNC) was compiled between 1991 and 1994. It contains 100 million words drawn from newspapers, fiction, academic papers, spoken transcripts, and personal letters. The goal was to capture what "normal English" looks like when you stop arguing about what normal means and just measure it.

From this corpus, you can extract the relative frequency of any word. The word "the" accounts for about 6.18% of all tokens. The word "moor" accounts for about 0.0008%. These numbers aren't prescriptive. They're descriptive. They're what happens when you observe real English at scale.

The interesting question is what you can do with them.

Given a relative frequency, you can compute an expected count for any text of any length:

import bnc_lookup as bnc

# How many times should "the" appear in a 50,000-word text?
bnc.expected_count("the", 50000)
# → 3090.7

# How many times should "moor" appear?
bnc.expected_count("moor", 50000)
# → 0.4

In a 50,000-word text, general English predicts "the" about 3,091 times and "moor" roughly zero times. If you're reading The Hound of the Baskervilles, "moor" appears dozens of times.

That gap between expected and observed is the entire story.

Seeing Deviation

Consider this sentence:

The moor stretched away in a great sweep of undulating ground, broken by the jagged outlines of the granite tors.

Every word in that sentence has a BNC expected frequency. Most of them, "the," "in," "a," "of," "by," appear roughly as often as you'd expect. They're the scaffolding of English. Invisible because ubiquitous.

But moor, undulating, jagged, granite, tors, these are rare. In general English, you encounter "tors" about once per million words. In this Conan Doyle passage, it appears in 18 words. That's an overuse ratio of roughly 55x.

If you color each word by its ratio, overused words glow warm, underused words glow cool, and expected words stay transparent, the passage lights up. Domain-specific vocabulary becomes immediately visible. The landscape of the moor emerges from the landscape of frequency.

The demo above does exactly this. Select a passage, and every word becomes a colored span. Red means the word appears far more than general English would predict. Blue means far less. Hover for the raw numbers.

What Overuse Reveals

An author's word frequency profile is a kind of fingerprint. Some of it is deliberate: Conan Doyle writes about Dartmoor, so "moor," "hound," "baronet," "granite" appear because the subject demands them. That's not leakage. That's content.

But the profile reveals more than subject matter. Doyle also underuses "she," "woman," "children." The Sherlock Holmes stories are overwhelmingly male worlds. He didn't set out to signal that. The frequency data surfaces it without any semantic analysis. Just counting.

Melville writing about whaling overuses "whale," "sea," "captain," "harpoon." Again, expected. But he also overuses "white." Across Moby-Dick, "white" appears at roughly 4x the BNC expected rate. The obsession is visible in the arithmetic, and it sits in a different category from the whaling vocabulary. "Whale" is a topic choice. "White" at that rate is something closer to a fixation.

Technical writing is even more striking. A machine learning paper will overuse "model," "training," "attention," "gradient," "loss" at 10x to 50x the general English baseline. The domain vocabulary doesn't just appear. It dominates.

The Neutral Zone

Equally revealing is what stays neutral. Function words, "the," "of," "and," "in," "to," "is," are remarkably stable. They appear at near-BNC rates regardless of author, genre, or century. They're the skeleton of English, invariant to subject matter.

This stability is what makes the overuse and underuse meaningful. If everything varied wildly, deviation would mean nothing. The neutral words provide the baseline that makes the colorful words informative.

The Stylometry Connection

Stylometry, the quantitative analysis of writing style, has relied on word frequency since the 1960s. Mosteller and Wallace used function-word frequencies to settle the authorship of disputed Federalist Papers. The technique was simple: count how often each author uses "upon" vs. "on," "whilst" vs. "while," "enough" vs. "sufficient."

These choices are largely unconscious. You don't decide to use "upon" 3.2 times per 1,000 words. You just do. And that rate is remarkably consistent across your writing, even as your topics change.

The BNC provides the reference distribution that makes this work at scale. Instead of comparing two authors against each other, you can compare any author against the corpus. The question shifts from "Does Author A differ from Author B?" to "How does this text differ from general English?"

This is more powerful than it sounds. A single text, with no comparison sample, can reveal:

Domain: What the text is about (overused content words)
Register: How formal it is (overuse of Latinate vs. Germanic vocabulary)
Era: When it was written ("whilst" overuse vs. "while")
Authorial habits: Unconscious preferences (hedging words, sentence-initial conjunctions)

All from comparing observed counts against a single reference distribution.

Linguistic Fingerprinting

The forensic application follows directly. If every author carries a distinctive frequency profile, then the profile can identify the author. This is linguistic fingerprinting, and it works because the signal lives in the words authors don't think about.

Content words are unreliable identifiers. A mystery writer uses "murder" because the genre demands it, not because of personal style. But the rate at which that same writer uses "however," "upon," "rather," or "quite" is genuinely idiosyncratic. These function-word frequencies persist across topics, across years, across conscious attempts to disguise one's style. They are the writer's involuntary signature.

Forensic linguists have used exactly this technique in court. The Unabomber case turned partly on linguistic analysis of Ted Kaczynski's manifesto. Authorship disputes over anonymous op-eds, disputed wills, and threatening letters have all been resolved by comparing frequency profiles against known writing samples. The BNC baseline makes this quantitative rather than impressionistic: instead of an expert saying "this feels like the same author," you can measure the statistical distance between two frequency distributions and assign a confidence level.

The same principle scales to literary scholarship. When a previously unknown manuscript surfaces, or when a collaborative work needs its contributions disentangled, frequency analysis provides evidence that doesn't depend on subjective reading. The words give the author away, one ratio at a time.

This cuts both ways. A professional novelist has a legitimate frequency profile, shaped by decades of deliberate craft. When forensic analysis identifies their style, it's confirming authorship, not exposing deception. But when a student submits work with a frequency profile that doesn't match their previous writing, or when a text carries the telltale flatness of AI-generated prose, the same technique flags an identity mismatch. The math doesn't distinguish between "interesting" and "suspicious." It just measures the gap.

What the BNC Doesn't Know

The BNC was compiled in the early 1990s. It contains none of the vocabulary that arrived after 1994: "emoji," "podcast," "blockchain," "selfie," "vlog." The word "cloud" still refers to weather. Thirty years of linguistic drift are simply absent.

This is both a limitation and a feature.

As a limitation: modern jargon will always register as "not in BNC." The heatmap grays out words the corpus never saw. For contemporary technical writing, a significant fraction of the vocabulary may fall into this category.

As a feature: the BNC represents a stable, well-documented baseline that doesn't shift with fashion. The word "however" appeared 14 times per 100,000 words in 1993, and it still appears at roughly that rate. Function words are even more stable. The core of English doesn't move fast.

A frozen corpus is a fixed ruler. It measures the same thing every time you use it. That consistency is valuable for comparison, even if the ruler was made thirty years ago.

From Counting to Code

The bnc-lookup Python library makes these calculations trivial. Four functions cover most use cases:

import bnc_lookup as bnc

# Does this word exist in the BNC?
bnc.exists("moor")               # True
bnc.exists("blockchain")         # False

# What frequency bucket? (1 = most common, 100 = rarest)
bnc.bucket("the")               # 1
bnc.bucket("moor")              # 2
bnc.bucket("blockchain")        # None (not in BNC)

# Exact relative frequency (occurrences per word)
bnc.relative_frequency("the")   # 0.0618
bnc.relative_frequency("moor")  # 0.000008

# Expected occurrences in a text of N words
bnc.expected_count("the", 10000) # 618.1
bnc.expected_count("moor", 10000)# 0.08

The library bundles the full BNC frequency data as pre-compiled Python modules, eliminating the need for a database, file I/O, or external downloads. Lookups run in microseconds through MD5-based hash indexing against frozen sets. The entire corpus, 669,417 unique word forms, loads lazily as needed.

This is a deliberate design choice. Corpus linguistics tools traditionally require downloading large data files, configuring paths, and managing database connections. By embedding the data directly into the Python package, the activation energy drops to zero: pip install bnc-lookup and you're counting.

In an era where the default developer instinct is to route every question through an LLM API call, there's something to be said for localized intelligence. A hash lookup against a frozen set runs in O(1). No network round-trip. No token budget. No rate limit. No latency. The answer to "how common is this word in English?" doesn't require a neural network. It requires a dictionary compiled thirty years ago and a division operation.

Building the Heatmap

The word frequency heatmap is a straightforward application of expected_count(). For each word in a passage:

Count how many times it appears (observed count)
Compute expected_count(word, total_words)
Divide observed by expected to get the ratio
Map the ratio to a color on a diverging scale

BNC vocabulary zones by frequency bucket, from Core Vocabulary at bucket 1 through Corpus Noise at bucket 100 — The BNC's 669,417 word forms fall into distinct vocabulary zones. Core vocabulary (buckets 1–10) accounts for 62.5% of all text despite containing relatively few unique words. The long tail of specialized, rare, and noise terms fills the remaining buckets.

The color mapping uses a logarithmic scale. This matters because frequency ratios span orders of magnitude. "Tors" at 55x overuse and "the" at 1.0x expected can't share a linear scale without one of them becoming invisible. Log scaling gives equal visual weight to 2x overuse and 0.5x underuse.

The result is a paragraph where your eye is drawn to the words that deviate. The common scaffolding disappears. The interesting words remain.

Beyond the Heatmap

Word frequency deviation is a primitive. Simple to compute, simple to visualize. But it opens doors to more complex analyses.

Drift detection. Split a text into overlapping windows. Compute the frequency profile of each window. Compare adjacent windows using chi-squared distance. When the distance spikes, the style changed. This is how you detect ghostwriting, AI-generated insertions, or editorial interventions. The pystylometry drift demo implements exactly this, and expected_count() provides the baseline for the chi-squared computation.

Author profiling. Compute the overuse/underuse profile for an entire body of work. Which words does this author consistently overuse relative to general English? The answer is surprisingly stable across their works and surprisingly distinctive between authors.

AI detection. Large language models produce text with a characteristic frequency profile. They tend to avoid extreme overuse and extreme underuse, resulting in a suspiciously flat deviation landscape. Human writing is lumpier. The BNC baseline makes this measurable.

Readability analysis. Texts that heavily overuse low-frequency BNC words are harder to read. Not because rare words are inherently difficult, but because density of unfamiliar vocabulary increases cognitive load. The frequency profile predicts readability without any readability formula.

A Hundred Million Words Ago

The BNC was created before the web consumed English. Before texting compressed it. Before social media flattened register distinctions. It captures a version of English that was already vanishing when the corpus was sealed.

This makes it a time capsule. Not just of words, but of their frequencies, their relationships to each other, their relative commonness in a world that communicated differently.

When you compare modern text against the BNC, you're measuring distance from that world. "Tweet" and "hashtag" register as unknown. "Shall" and "whom" register as underused because modern English uses them less than 1993 English did. The corpus is a fixed point in a moving language.

That fixed point matters more now than it did a decade ago. As AI-generated text saturates the web, language models trained on their own output begin to flatten the very distributions they learned from. Early model collapse is insidious: overall benchmarks may improve while performance on edge cases quietly degrades, the model getting more "average" without obviously getting worse.

A stack of books in a grand columned hall — The corpus doesn't move. The language does.

Pre-2020 corpora like the BNC become more valuable as baselines for human language patterns, not less. They capture what English looked like before generative models began smoothing its edges.

. . .

References

BNC Consortium. (2007). "The British National Corpus, version 3 (BNC XML Edition)." Distributed by Bodleian Libraries, University of Oxford.
Mosteller, F. and Wallace, D. L. (1964). "Inference and Disputed Authorship: The Federalist." Addison-Wesley.
Kilgarriff, A. (2001). "Comparing Corpora." International Journal of Corpus Linguistics, 6(1).
Craig Trim. (2025). "bnc-lookup: Fast BNC word frequency lookups for Python." PyPI.
Craig Trim. (2025). "BNC Word Frequency Heatmap." Interactive demo.

Appendix: Modern Corpora

The BNC was sealed in the early 1990s. Several larger and more recent corpora now exist for researchers who need contemporary baselines.

BNC2014 is the direct successor: 100 million words of present-day British English, built by Lancaster University with the same design as the original, plus an "E-language" section for digital text. The closest apples-to-apples upgrade.

COCA (Corpus of Contemporary American English) is the largest freely accessible genre-balanced corpus: over one billion words spanning spoken language, fiction, magazines, newspapers, academic texts, TV and film subtitles, blogs, and web pages. For most practical purposes, COCA is the modern default.

NOW (News on the Web) contains over 16 billion words from online newspapers and magazines, 2010 to the present, updated daily. Massive, but skewed toward news register.

iWeb offers 14 billion words from web pages across six countries (2017). Large but not genre-balanced. GloWbE provides 1.9 billion words across twenty English-speaking countries, useful for comparing regional varieties.