← All Articles

The Hidden Geography of Language

How words turn into coordinates, and why "king minus man plus woman" equals "queen".

In 1957, linguist J.R. Firth wrote: "You shall know a word by the company it keeps."

Big Bird in a police lineup with men in suits
Today's embeddings are brought to you by the letters Q, K, and V.

Sixty years later, that sentence became the theoretical foundation for how every major language model understands text.

GPT-4, Claude, LLaMA, Mistral: they all represent words as points in high-dimensional space. Similar words cluster together. Related concepts form neighborhoods. And the distance between two points encodes how semantically related they are.

This is the story of how meaning became geometry.

The Problem with Words

Computers don't understand words. They understand numbers.

When you type "sky" into a language model, the machine doesn't have some internal notion of "the atmosphere visible from Earth." It has a sequence of bytes: 0x73, 0x6B, 0x79. That's it.

Early NLP systems tried to bridge this gap with hand-crafted rules. "Sky" belongs to the category NATURE. It relates to WEATHER. It's a noun. But these symbolic approaches couldn't scale. Human annotators couldn't possibly enumerate every relationship between every word.

The breakthrough came from a different direction entirely: let the computer learn what words mean by watching how they're used.

Firth's insight was simple but profound: words that appear in similar contexts have similar meanings.

Consider these sentences:

Now consider these:

"Sky" and "ceiling" share some contexts ("looked up at the ___") but diverge in others. They're related but not identical. A system that tracks these patterns could learn that "sky" and "ceiling" are both "things above you" but differ in scale and setting.

This is precisely what word embeddings are designed to do.

Words as Vectors

A word embedding is a list of numbers that represents the meaning of a word. Each number corresponds to a dimension, and the collection of numbers forms a point in space.

"sky" โ†’ [0.234, -0.891, 0.127, 0.445, ..., -0.332]  (768 dimensions)
"atmosphere" โ†’ [0.198, -0.845, 0.089, 0.512, ..., -0.298]
"basement" โ†’ [-0.445, 0.123, -0.667, -0.234, ..., 0.891]

These numbers aren't arbitrary. They emerge from training: the model adjusts vectors to minimize prediction error on massive text corpora. Words that appear in similar contexts develop similar vectors.

Why 768 dimensions? Why not 3, or 50, or 10,000?

Too few dimensions: The space gets crowded. Unrelated words collide. "Bank" (financial) and "bank" (river) need room to separate based on context, but there's no space for nuance.

Too many dimensions: Diminishing returns. Each new dimension adds parameters to learn, but the marginal semantic information decreases. Training becomes slower without proportional benefit.

The sweet spot depends on vocabulary size and the amount of training data.

Modern large language models use:

Model          Embedding Dimensions
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
BERT-base      768
GPT-2          768
GPT-3          12,288
LLaMA 2 (7B)   4,096
LLaMA 2 (70B)  8,192

Larger models can afford more dimensions because they have more parameters to learn the relationships.

Here's what surprises most people: individual dimensions don't have human-interpretable meanings.

There's no "noun-ness" dimension. No "positive sentiment" axis. The model discovers whatever structure helps it predict text. Some of that structure aligns with human categories. Much of it doesn't.

Researchers have probed embeddings and found that specific directions encode properties like:

But these are directions, not individual dimensions. The encoding is distributed across the whole vector.

Distance as Meaning

Once words are vectors, we can measure the distance between them. And that distance encodes semantic similarity.

The straight-line distance between two points:

distance(sky, atmosphere) = sqrt(
    (0.234 - 0.198)ยฒ +
    (-0.891 - -0.845)ยฒ +
    (0.127 - 0.089)ยฒ +
    ...
)

Small distance = similar meaning. Large distance = different meaning.

Boy looking up at the moon
Same position in space. Different position in meaning.

In practice, cosine similarity is more common than Euclidean distance. It measures the angle between two vectors, ignoring their magnitudes:

cosine_similarity(A, B) = (A ยท B) / (||A|| ร— ||B||)

Two vectors pointing in the same direction have similarity = 1.
Perpendicular vectors have similarity = 0.
Opposite vectors have similarity = -1.

Why cosine over Euclidean? Word vectors can have different magnitudes based on word frequency and training dynamics. Cosine ignores this, focusing purely on the direction of meaning.

Let's compute distances between words (using a hypothetical 2D projection for illustration):

Word           X       Y
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
sky            290     200
atmosphere     330     170
ceiling        580     270
basement       620     530

Euclidean distances from "sky":

sky โ†’ atmosphere:  โˆš[(330-290)ยฒ + (170-200)ยฒ] = 50
sky โ†’ ceiling:     โˆš[(580-290)ยฒ + (270-200)ยฒ] = 299
sky โ†’ basement:    โˆš[(620-290)ยฒ + (530-200)ยฒ] = 485

The numbers match intuition: "sky" is closest to "atmosphere" (both refer to the open air above), moderately distant from "ceiling" (a surface above you, but indoors), and far from "basement" (enclosed, underground).

This is the core insight: semantic relationships become geometric relationships.

Semantic Regions

When you plot thousands of word vectors in 2D (using dimensionality reduction techniques like t-SNE or UMAP), patterns emerge. Words cluster into regions.

Some regions have obvious interpretations:

Region              Example Words
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
aerial_domain       sky, atmosphere, clouds, stratosphere
water_domain        ocean, sea, lake, river
enclosed_space      room, box, chamber, cell
underground         basement, cave, tunnel, abyss
human_concepts      king, queen, man, woman, father, mother
time_concepts       moment, eternity, instant, epoch
movement            running, flying, swimming, falling

Important caveat: These labels are human interpretations of emergent clusters. The model stores only coordinates. It has no concept of "aerial_domain" as a category. We impose that interpretation when we see words grouping together.

This is both the power and the limitation of vector spaces: they capture statistical structure that correlates with meaning, but they don't encode meaning explicitly.

Real semantic space isn't cleanly partitioned. Words belong to multiple regions simultaneously:

Word           Regions
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
sky            aerial_domain, openness, vertical_high, natural
cave           underground, natural, bounded
river          natural, water, movement
ceiling        bounded, vertical_high

"Sky" participates in the aerial cluster but also shares properties with "open" concepts and "natural" phenomena. "Cave" is underground but also natural and enclosed.

This multi-membership is a feature, not a bug. Real concepts have multiple facets. A vector space can encode that complexity through position.

Vector Arithmetic

In 2013, Mikolov et al. discovered something remarkable: word vectors support meaningful arithmetic.

vector("king") - vector("man") + vector("woman") โ‰ˆ vector("queen")

This works because the relationship "king:queen" encodes the same gender transformation as "man:woman." In vector terms, the offset from "man" to "woman" is similar to the offset from "king" to "queen."

king โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ queen
  โ”‚                      โ”‚
  โ”‚  (add feminine,      โ”‚
  โ”‚   subtract masculine)โ”‚
  โ”‚                      โ”‚
  โ†“                      โ†“
man โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ woman

The vector (king - man) captures something like "royalty." Adding that to "woman" lands you near "queen."

The arithmetic isn't magic. It's geometry.

  1. Compute the target point: A - B + C = target
  2. Find the nearest existing word vector to that target
  3. That word is the "answer"
def analogy(A, B, C, vocabulary):
    # A is to B as C is to ?
    target = vectors[A] - vectors[B] + vectors[C]

    # Find nearest word (excluding inputs)
    best_word = None
    best_distance = infinity

    for word in vocabulary:
        if word in [A, B, C]:
            continue
        dist = distance(target, vectors[word])
        if dist < best_distance:
            best_distance = dist
            best_word = word

    return best_word

# Example
analogy("king", "man", "woman")  # Returns "queen"

Vector arithmetic works when the relationship is linear and consistent across the vocabulary.

How Words Learn Their Positions

The vectors don't come from nowhere. They emerge from training on massive text corpora.

Modern embedding methods (Word2Vec, GloVe, and the embedding layers in transformers) all share a core idea: predict words from context, and adjust vectors to minimize prediction error.

Consider training on this sentence: "looked up at the sky"

The model sees: ["looked", "up", "at", "the", "___"]

Its job: predict that the blank is "sky."

Initially, the prediction is random. But after seeing millions of sentences, the model adjusts:

Different words develop different contexts:

"sky"          "cave"          "king"
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
looked up at   inside the      the king ruled
birds across   dark cave       bow to the king
the sky was    explored the    king and queen
endless sky    cave paintings  the king decreed

"Sky" appears in outdoor, upward, light contexts. "Cave" appears in underground, enclosed, dark contexts. "King" appears in power, royalty, and governance contexts.

After training, their vectors reflect these associations. "Sky" and "cave" end up far apart (opposite contexts). "King" ends up near "queen," "prince," and "ruler" (shared governance contexts).

Here's what's profound: the model never receives semantic labels.

Nobody tells it that "sky" means "the atmosphere." Nobody defines categories. The model just predicts text and adjusts numbers.

Yet it discovers structure that aligns with human semantic categories. Why?

Because language itself encodes meaning through usage patterns. The distributional hypothesis isn't just a modeling trick. It's an observation about how meaning works. Words that mean similar things get used similarly. Training on usage patterns recovers meaning.

Interactive Vector Space

The demo below visualizes these concepts in a 2D projection:

Interactive Vector Space Demo

Note: This is a 2D projection of what would normally be ~768 dimensions. Region labels are human interpretations of emergent clusters. The vectors are illustrative, not from a production model.

From Word Vectors to Language Models

Word2Vec and GloVe produce static embeddings: one vector per word, regardless of context. "Bank" gets the same vector whether it means a financial institution or a river bank.

Modern language models (GPT, BERT, Claude) use contextual embeddings: the vector for a word changes based on surrounding text.

Static embedding (always the same):

"bank" โ†’ [0.234, -0.891, ...]

Contextual embedding:

"I went to the bank to deposit money" โ†’ bank =
[0.234, -0.891, ...]

"I sat on the river bank" โ†’ bank =
[-0.445, 0.567, ...]

The transformer architecture enables this by computing attention across the full input sequence. Each token's representation incorporates information from all other tokens.

But the core insight remains: meaning is geometric. Whether static or contextual, embeddings encode semantic relationships as spatial relationships.

The Philosophical View

Vector spaces reduce meaning to geometry. Is that... meaning?

The model doesn't "understand" that the sky is above us or that kings rule kingdoms. It knows that "sky" and "above" have high cosine similarity. It knows that "king" vectors relate to "rule" vectors through consistent offsets.

This is a functional definition of meaning: meaning is the pattern of relationships a concept has with other concepts. If two things have the same relationships, they mean the same thing.

Critics argue this misses something essential. Embodiment. Experience. Grounding. A model that learns "sky" from text has never looked up.

But perhaps meaning was never what we thought it was. Perhaps relationships were always the substance, and our intuition of "deeper understanding" was just the feeling of having many relationships.

Either way, the vectors work. They power search engines, translation systems, chatbots, and a thousand other applications. The geometry of meaning, whatever its philosophical status, is remarkably useful.


References

  1. Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv.
  2. Mikolov, T., et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS.
  3. Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP.
  4. Peters, M., et al. (2018). "Deep contextualized word representations." NAACL (ELMo paper).
  5. Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers." NAACL.
  6. Bolukbasi, T., et al. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." NeurIPS.
  7. Firth, J.R. (1957). "A Synopsis of Linguistic Theory, 1930-1955." Studies in Linguistic Analysis.