Words Learning the Company They Keep

Firth said you know a word by the company it keeps. Early embeddings disagreed, giving 'bank' the same vector regardless of whether it kept company with 'river' or 'robbery.'

Early embeddings (Word2Vec, GloVe, FastText) operated on a simple principle: one word, one vector.

During training, the model sees millions of sentences containing "fork." Some involve dinner tables, some involve GitHub repositories, some involve chess positions, some involve roads diverging. The optimization process finds a single point in high-dimensional space that minimizes prediction error across all these contexts.

The result: "fork" lands at a geometric compromise. Not quite in the dining cluster. Not quite in the git cluster. Not quite in the chess cluster. Somewhere in between, equidistant from all its meanings.

This is the centroid problem. The vector represents the average of all usages, which corresponds to no actual usage.

Large visualization showing clustered semantic regions with oracle and fork as polysemous centroids positioned between their relevant clusters
Interactive: Oracle: priestess, database, prophet, superhero. Pick one. (The model can't.)

The training corpus contains sentences like:

The model has no semantic labels. It sees only word co-occurrence patterns. "Oracle" appears near "priestess" and "temple" sometimes. Near "database" and "SQL" other times. Near "Neo" and "Matrix" in another set. Near "Batgirl" and "Gotham" in yet another.

Gradient descent does what it always does: minimize loss. It finds coordinates for "oracle" that make all four contexts reasonably predictable, even though no single position makes any of them optimally predictable.

The resulting vector might land here in a simplified 2D projection:

Nearby word      | Cosine similarity
·················|···················
prophecy         | 0.61
database         | 0.58
Morpheus         | 0.54
Gotham           | 0.49

None of these similarities are wrong. None are precise either. The embedding encodes a blurred superposition of four distinct concepts, recoverable only if you already know which meaning you want.

When Context Meant Selection

It is strange how quickly "before LLMs" became "ancient history." The field operated under a fundamental constraint: either encode human knowledge explicitly through semantic resources and heuristics, or encode it implicitly through labeled examples.

You could build precise rule-based systems leveraging carefully curated semantic resources like WordNet, accepting that they would shatter against edge cases and resist scaling. Or you could train statistical classifiers, one task at a time, each demanding months of annotation labor to produce ground truth labels that might not generalize beyond your corpus. Both paths led somewhere useful. Neither led anywhere fast.

Sketch-style illustration showing a figure pointing at numbered doors with an audience watching, metaphor for word sense disambiguation
Behind each door: a different meaning of "oracle"

Word sense disambiguation (WSD) exemplified this tradeoff.

The dominant approach was enumeration. WordNet (Miller, 1995) maintains discrete sense entries, each with its own synset containing synonyms, hypernym chains, and definitional glosses. For "oracle," the inventory would need entries corresponding to your four clusters:

oracle.n.01 (ancient_prophecy)
    Gloss: a priestess or shrine through which a deity reveals
           hidden knowledge or prophecy
    Hypernym: prophet -> person
    See also: Delphi, divination, pythia

oracle.n.02 (enterprise_software)
    Gloss: Oracle Corporation, an American multinational computer
           technology company specializing in database software
    Hypernym: company -> organization
    See also: database, SQL, Ellison

oracle.n.03 (matrix_universe)
    Gloss: a character in The Matrix film franchise who provides
           cryptic guidance to human resistance fighters
    Hypernym: fictional_character -> entity
    See also: Neo, Morpheus, prophecy

oracle.n.04 (dc_superhero)
    Gloss: the alias adopted by Barbara Gordon after her paralysis,
           serving as an information broker for Gotham's vigilantes
    Hypernym: fictional_character -> entity
    See also: Batgirl, Batman, hacker

The representation is symbolic, not geometric. No centroid problem arises because senses never merge into a shared vector space.

Word Sense Disambiguation (WSD)

Given a sentence, the task becomes selecting the correct synset. Classical approaches developed along several lines.

Lesk's algorithm (Lesk, 1986) compared dictionary definitions against context words through lexical overlap. "Oracle database migration" shares more tokens with the enterprise_software gloss than with the ancient_prophecy gloss. Simplified Lesk variants (Kilgarriff & Rosenzweig, 2000) improved efficiency by comparing context directly against glosses without computing pairwise sentence similarity.

Supervised classification trained on sense-annotated corpora. SemCor (Miller et al., 1993) provided hand-labeled WordNet senses for Brown Corpus texts. Later, OntoNotes (Hovy et al., 2006) offered coarser-grained distinctions more suitable for NLP applications. Features included bag-of-words context windows, part-of-speech tags, and syntactic dependency relations. Accuracy on standard benchmarks plateaued at around 70-75% (Navigli, 2009), a respectable but brittle performance when confronting domain shift.

Knowledge-based methods exploited WordNet's graph structure. Algorithms like Personalized PageRank (Agirre & Soroa, 2009) treat sense selection as a random walk problem: if "SQL" and "server" appear nearby, which synset for "oracle" has the shortest path to database-related concepts? The UKB system demonstrated competitive performance without requiring labeled training data.

Selectional preferences (Resnik, 1997) were modeled to represent verb-argument plausibility. "Query" selects for database-like objects; "consult" selects for human-like entities. This provided soft constraints on sense selection based on syntactic role.

Critical Limitations

The approach does not scale gracefully. WordNet 3.0 contains roughly 117,000 synsets covering around 155,000 word forms. English, meanwhile, resists enumeration. The Oxford English Dictionary catalogues over 170,000 words in current use; technical and scientific vocabularies push estimates toward a million or more. WordNet's coverage, while impressive as a research artifact, represents a fraction of the living language.

Even within that fraction, sense granularity is inconsistent. Some words have 30+ senses reflecting fine lexicographic distinctions, while others conflate obviously distinct usages. The Matrix's Oracle and Barbara Gordon's Oracle would both collapse into something like "person who delivers guidance," despite sharing no other semantic properties. They occupy entirely separate fictional universes with different narrative functions.

Then there is accuracy. The best WSD systems of the era achieved 70-75% on standard benchmarks (Navigli, 2009). Impressive for a research task, catastrophic at scale. Process a million words, and you have 250,000 or more disambiguation errors propagating downstream. Supplement WordNet with domain ontologies, mine definitions from Wikipedia, and add heuristics for named entities. You can push the numbers. You cannot escape them.

More fundamentally, discrete sense inventories assume meanings are enumerable. They are not. Novel metaphors, domain-specific jargon, and productive polysemy resist cataloguing. When "Google" shifted from proper noun to verb, no WordNet committee could anticipate the semantic space it would come to occupy.

I wrote an article nearly 11 years ago that discussed the use of WordNet polysemy as a method of re-ranking synonyms for word variability.

Screenshot of the WordNet Search 3.1 interface showing a search for the term graft with its various noun and verb senses
Polysemy in 2015

Those were challenging times for instructing a computer in the full use of natural language.

Attention as Implicit Disambiguation

The transformer's solution is elegant in its indirection: stop trying to enumerate senses. Stop trying to classify into buckets. Instead, compute what a word means right now, in this sentence, from these neighbors.

Static embeddings asked: "What single point in space best represents all usages of 'oracle'?"

Zoomed view of vector space showing oracle at center with four semantic clusters: ancient_prophecy, enterprise_software, matrix_universe, and dc_superhero
"oracle" is a highly polysemous term

Transformers ask: "Given that 'oracle' appears between 'ancient prophecy' and 'Neo,' where should its vector land this time?"

The vector doesn't sit at a static centroid. It moves through semantic space as each neighboring token's contribution accumulates. "Ancient" and "prophecy" pull it toward Delphi. "Convinced" tugs slightly toward enterprise software. Then "Neo" arrives, a proper noun with exactly one dominant association, and the vector snaps into the Matrix cluster.

This is learned behavior, not programmed rules. No engineer specified that "Neo" should override "ancient prophecy." The model discovered, through gradient descent on prediction tasks, that "Neo" in proximity to "oracle" overwhelmingly predicts Matrix-related continuations. The attention weights reflect corpus statistics transformed into geometric operations.

Try it yourself:

Context-to-Region Sankey diagram demo
Interactive: Edge thickness is corpus co-occurrence probability.

Disambiguation emerges from geometry. The four meanings of "oracle" aren't four discrete entries in a dictionary; they're four regions of vector space. Context determines which region the computed representation occupies. Different sentences yield different vectors, all starting from the same embedding but diverging through attention.

Try this yourself:

Attention in Action demo
Interactive: Attention computes destination from context. Add words and watch the oracle vector leave its compromise position and commit to a meaning.

The shift from enumeration to computation represents more than a change in technique. It reflects a different ontological stance toward meaning itself. Static embeddings and WordNet both assume that word senses exist as stable entities waiting to be discovered or catalogued. Transformers make no such assumption. Meaning becomes a function, not a fact. The same word in different contexts isn't assigned to different pre-existing categories; it simply computes to different vectors.

The Mechanics of Contextual Computation

Attention operates through three learned projections: queries, keys, and values. Each token's embedding is transformed into these three vectors through separate weight matrices. The query asks: "What information do I need?" The key advertises: "Here is what I contain." The value provides: "Here is what I contribute."

For a target word like "oracle," the process unfolds as follows. The model projects the oracle embedding into a query vector. Every other token in the sequence projects into key vectors. The dot product between the oracle query and each context key measures relevance. High dot products indicate tokens whose information should influence the oracle representation. Low dot products indicate tokens to ignore.

These raw scores undergo softmax normalization, converting them into a probability distribution. The result is a set of attention weights summing to one. Each context token then contributes its value vector, scaled by its attention weight. The oracle's new representation is this weighted sum: a blend of value vectors from tokens deemed relevant by the query-key matching.

The critical insight is that all three projections are learned. The model doesn't know a priori that "Neo" should strongly influence "oracle." It discovers this through training. When "oracle" appears near "Neo" in the corpus, predicting the next token (perhaps "said" or "explained") requires understanding that this is Matrix-oracle, not database-oracle. Backpropagation adjusts the projection matrices so that Neo's key aligns with oracle's query in these contexts. Over billions of examples, the geometry encodes disambiguation as a byproduct of next-token prediction.

This is a hypothetical Q-K-V conversation.

Try it yourself:

Q-K-V Tokens in Conversation demo
Interactive: The dialog simplifies. In reality, "oracle" doesn't ask one question; it asks twelve (or however many heads the model uses).

The dialog simplifies. In reality, "oracle" doesn't ask one question; it asks twelve (or however many heads the model uses). Each head poses a different query. One might ask about syntactic role: "Am I the subject or object here?" Another asks about the semantic domain: "What field am I in?" A third asks about narrative context: "Who are the characters nearby?" Each query elicits different responses from the same neighbors. Neo's key might strongly match the semantic-domain query while weakly matching the syntactic query. The final representation blends all these conversations, weighted by what proved helpful during training.

Layers add depth to the conversation. The dialog you just saw represents a single round of the game. But transformers stack these rounds. After the first exchange, every token has updated its representation based on its neighbors. In the second round, "oracle" queries do not use the original embeddings but instead the already-contextualized vectors. It's asking tokens that have already listened to their neighbors. By layer twelve, the conversation has propagated information across the entire sequence. The "oracle" vector no longer reflects just its immediate neighbors; it reflects patterns that emerged from their neighbors' neighbors, recursively integrated.

From Cartography to Navigation

Firth's intuition was linguistic, not computational: meaning arises from context, not from dictionary entries. For decades, NLP tried to honor this insight while working around it. WordNet catalogued senses. WSD systems selected among them. The context was consulted, but only to choose which pre-existing meaning applied. The map preceded the territory.

Transformers invert this. There is no map. Each forward pass computes a new position in semantic space, determined entirely by the tokens present. "Oracle" doesn't retrieve a sense; it calculates a vector. The calculation is shaped by training, which itself was shaped by billions of contexts in which words kept various kinds of company. Firth's dictum becomes architecture.

Whether this constitutes understanding remains contested. The model has no grounding in Greek religious practices, corporate database licensing, or the Wachowskis' philosophical influences. It knows only that certain words predict certain other words in specific configurations. Yet from those patterns alone, it derives representations that separate meanings humans recognize as distinct. The gap between correlation and comprehension persists. The practical utility does not depend on resolving it.

Satirical dinner party illustration with wide-eyed figures, a take on Attention Is All You Need
Attention is all you need. Whether you wanted it or not.

What has changed is the nature of the problem. Polysemy is no longer a classification task requiring human-curated inventories. It is a geometric consequence of learned attention. The word finds its meaning by finding its neighbors, exactly as Firth suggested. The company it keeps is now computed, not consulted.


References

Further Reading

Attention Vector Embedding Attention Mechanism Natural Language Processing