The Hidden Geography of Language
How words turn into coordinates, and why "king minus man plus woman" equals "queen".
In 1957, linguist J.R. Firth wrote: "You shall know a word by the company it keeps."
Sixty years later, that sentence became the theoretical foundation for how every major language model understands text.
GPT-4, Claude, LLaMA, Mistral: they all represent words as points in high-dimensional space. Similar words cluster together. Related concepts form neighborhoods. And the distance between two points encodes how semantically related they are.
This is the story of how meaning became geometry.
The Problem with Words
Computers don't understand words. They understand numbers.
When you type "sky" into a language model, the machine doesn't have some internal notion of "the atmosphere visible from Earth." It has a sequence of bytes: 0x73, 0x6B, 0x79. That's it.
Early NLP systems tried to bridge this gap with hand-crafted rules. "Sky" belongs to the category NATURE. It relates to WEATHER. It's a noun. But these symbolic approaches couldn't scale. Human annotators couldn't possibly enumerate every relationship between every word.
The breakthrough came from a different direction entirely: let the computer learn what words mean by watching how they're used.
Firth's insight was simple but profound: words that appear in similar contexts have similar meanings.
Consider these sentences:
- "looked up at the sky."
- "Birds flew across the sky."
- "The sky turned dark."
Now consider these:
- "looked up at the ceiling."
- "birds flew across the room" (doesn't quite work)
- "the ceiling turned... dark?" (unusual but parseable)
"Sky" and "ceiling" share some contexts ("looked up at the ___") but diverge in others. They're related but not identical. A system that tracks these patterns could learn that "sky" and "ceiling" are both "things above you" but differ in scale and setting.
This is precisely what word embeddings are designed to do.
Words as Vectors
A word embedding is a list of numbers that represents the meaning of a word. Each number corresponds to a dimension, and the collection of numbers forms a point in space.
"sky" โ [0.234, -0.891, 0.127, 0.445, ..., -0.332] (768 dimensions) "atmosphere" โ [0.198, -0.845, 0.089, 0.512, ..., -0.298] "basement" โ [-0.445, 0.123, -0.667, -0.234, ..., 0.891]
These numbers aren't arbitrary. They emerge from training: the model adjusts vectors to minimize prediction error on massive text corpora. Words that appear in similar contexts develop similar vectors.
Why 768 dimensions? Why not 3, or 50, or 10,000?
Too few dimensions: The space gets crowded. Unrelated words collide. "Bank" (financial) and "bank" (river) need room to separate based on context, but there's no space for nuance.
Too many dimensions: Diminishing returns. Each new dimension adds parameters to learn, but the marginal semantic information decreases. Training becomes slower without proportional benefit.
The sweet spot depends on vocabulary size and the amount of training data.
Modern large language models use:
Model Embedding Dimensions ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท BERT-base 768 GPT-2 768 GPT-3 12,288 LLaMA 2 (7B) 4,096 LLaMA 2 (70B) 8,192
Larger models can afford more dimensions because they have more parameters to learn the relationships.
Here's what surprises most people: individual dimensions don't have human-interpretable meanings.
There's no "noun-ness" dimension. No "positive sentiment" axis. The model discovers whatever structure helps it predict text. Some of that structure aligns with human categories. Much of it doesn't.
Researchers have probed embeddings and found that specific directions encode properties like:
- Gender (male/female)
- Number (singular/plural)
- Tense (past/present/future)
- Formality (casual/formal)
But these are directions, not individual dimensions. The encoding is distributed across the whole vector.
Distance as Meaning
Once words are vectors, we can measure the distance between them. And that distance encodes semantic similarity.
The straight-line distance between two points:
distance(sky, atmosphere) = sqrt( (0.234 - 0.198)ยฒ + (-0.891 - -0.845)ยฒ + (0.127 - 0.089)ยฒ + ... )
Small distance = similar meaning. Large distance = different meaning.
In practice, cosine similarity is more common than Euclidean distance. It measures the angle between two vectors, ignoring their magnitudes:
cosine_similarity(A, B) = (A ยท B) / (||A|| ร ||B||)
Two vectors pointing in the same direction have similarity = 1.
Perpendicular vectors have similarity = 0.
Opposite vectors have similarity = -1.
Why cosine over Euclidean? Word vectors can have different magnitudes based on word frequency and training dynamics. Cosine ignores this, focusing purely on the direction of meaning.
Let's compute distances between words (using a hypothetical 2D projection for illustration):
Word X Y ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท sky 290 200 atmosphere 330 170 ceiling 580 270 basement 620 530
Euclidean distances from "sky":
sky โ atmosphere: โ[(330-290)ยฒ + (170-200)ยฒ] = 50 sky โ ceiling: โ[(580-290)ยฒ + (270-200)ยฒ] = 299 sky โ basement: โ[(620-290)ยฒ + (530-200)ยฒ] = 485
The numbers match intuition: "sky" is closest to "atmosphere" (both refer to the open air above), moderately distant from "ceiling" (a surface above you, but indoors), and far from "basement" (enclosed, underground).
This is the core insight: semantic relationships become geometric relationships.
Semantic Regions
When you plot thousands of word vectors in 2D (using dimensionality reduction techniques like t-SNE or UMAP), patterns emerge. Words cluster into regions.
Some regions have obvious interpretations:
Region Example Words ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท aerial_domain sky, atmosphere, clouds, stratosphere water_domain ocean, sea, lake, river enclosed_space room, box, chamber, cell underground basement, cave, tunnel, abyss human_concepts king, queen, man, woman, father, mother time_concepts moment, eternity, instant, epoch movement running, flying, swimming, falling
Important caveat: These labels are human interpretations of emergent clusters. The model stores only coordinates. It has no concept of "aerial_domain" as a category. We impose that interpretation when we see words grouping together.
This is both the power and the limitation of vector spaces: they capture statistical structure that correlates with meaning, but they don't encode meaning explicitly.
Real semantic space isn't cleanly partitioned. Words belong to multiple regions simultaneously:
Word Regions ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท sky aerial_domain, openness, vertical_high, natural cave underground, natural, bounded river natural, water, movement ceiling bounded, vertical_high
"Sky" participates in the aerial cluster but also shares properties with "open" concepts and "natural" phenomena. "Cave" is underground but also natural and enclosed.
This multi-membership is a feature, not a bug. Real concepts have multiple facets. A vector space can encode that complexity through position.
Vector Arithmetic
In 2013, Mikolov et al. discovered something remarkable: word vectors support meaningful arithmetic.
vector("king") - vector("man") + vector("woman") โ vector("queen")
This works because the relationship "king:queen" encodes the same gender transformation as "man:woman." In vector terms, the offset from "man" to "woman" is similar to the offset from "king" to "queen."
king โโโโโโโโโโโโโโโโโ queen โ โ โ (add feminine, โ โ subtract masculine)โ โ โ โ โ man โโโโโโโโโโโโโโโโโโ woman
The vector (king - man) captures something like "royalty." Adding that to "woman" lands you near "queen."
The arithmetic isn't magic. It's geometry.
- Compute the target point: A - B + C = target
- Find the nearest existing word vector to that target
- That word is the "answer"
def analogy(A, B, C, vocabulary):
# A is to B as C is to ?
target = vectors[A] - vectors[B] + vectors[C]
# Find nearest word (excluding inputs)
best_word = None
best_distance = infinity
for word in vocabulary:
if word in [A, B, C]:
continue
dist = distance(target, vectors[word])
if dist < best_distance:
best_distance = dist
best_word = word
return best_word
# Example
analogy("king", "man", "woman") # Returns "queen"
Vector arithmetic works when the relationship is linear and consistent across the vocabulary.
How Words Learn Their Positions
The vectors don't come from nowhere. They emerge from training on massive text corpora.
Modern embedding methods (Word2Vec, GloVe, and the embedding layers in transformers) all share a core idea: predict words from context, and adjust vectors to minimize prediction error.
Consider training on this sentence: "looked up at the sky"
The model sees: ["looked", "up", "at", "the", "___"]
Its job: predict that the blank is "sky."
Initially, the prediction is random. But after seeing millions of sentences, the model adjusts:
- The "sky" vector moves closer to words that predict it (up, looked, blue, clear)
- It moves away from words that don't (basement, floor, underground)
Different words develop different contexts:
"sky" "cave" "king" ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท looked up at inside the the king ruled birds across dark cave bow to the king the sky was explored the king and queen endless sky cave paintings the king decreed
"Sky" appears in outdoor, upward, light contexts. "Cave" appears in underground, enclosed, dark contexts. "King" appears in power, royalty, and governance contexts.
After training, their vectors reflect these associations. "Sky" and "cave" end up far apart (opposite contexts). "King" ends up near "queen," "prince," and "ruler" (shared governance contexts).
Here's what's profound: the model never receives semantic labels.
Nobody tells it that "sky" means "the atmosphere." Nobody defines categories. The model just predicts text and adjusts numbers.
Yet it discovers structure that aligns with human semantic categories. Why?
Because language itself encodes meaning through usage patterns. The distributional hypothesis isn't just a modeling trick. It's an observation about how meaning works. Words that mean similar things get used similarly. Training on usage patterns recovers meaning.
Interactive Vector Space
The demo below visualizes these concepts in a 2D projection:
- Explore Space: Click words to see their semantic regions and nearest neighbors
- Similarity & Distance: Watch connections form between related words
- Vector Arithmetic: Build your own "A - B + C = ?" analogies
- How It Works: See hypothetical training contexts for each word
Note: This is a 2D projection of what would normally be ~768 dimensions. Region labels are human interpretations of emergent clusters. The vectors are illustrative, not from a production model.
From Word Vectors to Language Models
Word2Vec and GloVe produce static embeddings: one vector per word, regardless of context. "Bank" gets the same vector whether it means a financial institution or a river bank.
Modern language models (GPT, BERT, Claude) use contextual embeddings: the vector for a word changes based on surrounding text.
Static embedding (always the same):
"bank" โ [0.234, -0.891, ...]
Contextual embedding:
"I went to the bank to deposit money" โ bank = [0.234, -0.891, ...] "I sat on the river bank" โ bank = [-0.445, 0.567, ...]
The transformer architecture enables this by computing attention across the full input sequence. Each token's representation incorporates information from all other tokens.
But the core insight remains: meaning is geometric. Whether static or contextual, embeddings encode semantic relationships as spatial relationships.
The Philosophical View
Vector spaces reduce meaning to geometry. Is that... meaning?
The model doesn't "understand" that the sky is above us or that kings rule kingdoms. It knows that "sky" and "above" have high cosine similarity. It knows that "king" vectors relate to "rule" vectors through consistent offsets.
This is a functional definition of meaning: meaning is the pattern of relationships a concept has with other concepts. If two things have the same relationships, they mean the same thing.
Critics argue this misses something essential. Embodiment. Experience. Grounding. A model that learns "sky" from text has never looked up.
But perhaps meaning was never what we thought it was. Perhaps relationships were always the substance, and our intuition of "deeper understanding" was just the feeling of having many relationships.
Either way, the vectors work. They power search engines, translation systems, chatbots, and a thousand other applications. The geometry of meaning, whatever its philosophical status, is remarkably useful.
References
- Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv.
- Mikolov, T., et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS.
- Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP.
- Peters, M., et al. (2018). "Deep contextualized word representations." NAACL (ELMo paper).
- Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers." NAACL.
- Bolukbasi, T., et al. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." NeurIPS.
- Firth, J.R. (1957). "A Synopsis of Linguistic Theory, 1930-1955." Studies in Linguistic Analysis.