Sources
Grounding, citations, and further reading for How Vector Databases Actually Work.
All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.
The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper knows where to look.
References
1Malkov, Y
Malkov, Y. A. & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836.
2Jegou, H
Jegou, H., Douze, M. & Schmid, C. (2011). "Product Quantization for Nearest Neighbor Search." IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117-128.
3Johnson, J
Johnson, J., Douze, M. & Jegou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data, 7(3), 535-547.
4Milgram, S
Milgram, S. (1967). "The Small World Problem." Psychology Today, 2(1), 60-67.
5Watts, D
Watts, D. J. & Strogatz, S. H. (1998). "Collective dynamics of 'small-world' networks." Nature, 393(6684), 440-443.
6Chroma
Chroma. (2024). "Chroma Documentation." trychroma.com.
7FAISS
FAISS. (2024). "FAISS: A library for efficient similarity search." Facebook Research, GitHub.
Introduction
8Grounding note
The textbook explains that embeddings represent tokens as arrays of floating-point numbers capturing semantic relationships. Vector databases exist because brute-force comparison of these high-dimensional vectors doesn't scale. Every indexing strategy in this article is an answer to that scaling problem. See GH #3, Ch. 3.
The Brute-Force Baseline
9Grounding note
SLP3 §5.4 defines the cosine similarity metric (Eq. 5.10) that underpins the "similarity score" referenced here. The cosine is the normalized dot product: cosine(v,w) = (v . w) / (|v| |w|). Jurafsky and Martin explain that normalization by vector lengths is essential because raw dot products favor frequent words with longer vectors. For unit-normalized vectors (which most modern embedding models produce), the dot product is the cosine, which is why FAISS's IndexFlatIP (inner product) and cosine search are equivalent on L2-normalized embeddings.
10Grounding note
Widdows and Cohen provide useful context here. In Ch. 2, they walk through the cosine similarity computation step by step using a term-document matrix, showing how query vectors are compared against document vectors. They note that in higher dimensions, randomly chosen pairs of vectors tend to have cosine similarity closer to zero, which is why high cosine scores become meaningful signals. This is exactly the distance computation that a brute-force scan repeats for every stored vector. Widdows & Cohen, Issue #45
11Grounding note
SLP3 §11.1.4 describes how sparse retrieval solved this same scaling problem decades ago using the inverted index: a data structure that maps each term to a postings list of documents containing that term. Given a query, you only score documents that share at least one term with the query, skipping the vast majority. The algorithms in this article (HNSW, IVF, PQ) are the dense-vector equivalent of that insight, achieving sub-linear search through spatial partitioning rather than term overlap. The inverted index is exact (no recall loss); these approximate methods trade some recall for speed.
12Grounding note
Widdows and Cohen discuss precision and recall as the foundational evaluation metrics for information retrieval, tracing them back to the Cranfield experiments of the 1960s (Ch. 2, Section 2.3.3). They define recall as the proportion of relevant documents actually retrieved, and precision as the proportion of retrieved documents that are relevant. The "speed versus recall" tradeoff in this article is a direct descendant of that IR tradition, applied to vector indices rather than keyword indices. Widdows & Cohen, Issue #45
13Grounding note
Raschka shows that embedding dimensions vary dramatically by model: GPT-2 small uses 768 dimensions, GPT-2 large uses 1,600, and GPT-3 175B uses 12,288. Every algorithm in this article must handle these high-dimensional spaces efficiently, and the choice of embedding model directly determines the index structure's performance characteristics. See GH #4, Ch. 2.
HNSW: The Dominant Algorithm
14Grounding note
SLP3 §11.3 frames dense retrieval as a nearest neighbor search problem and notes that "modern systems therefore make use of approximate nearest neighbor vector search algorithms like Faiss" (Johnson et al., 2017). Jurafsky and Martin position HNSW within the broader context of the bi-encoder architecture (§11.3, Eq. 11.18): once queries and documents are encoded as separate vectors using BERT_Q and BERT_D, finding the most relevant documents reduces to finding the vectors with the highest dot product, which is exactly the nearest neighbor problem HNSW solves.
15Grounding note
Widdows and Cohen discuss a related concept: the high-dimensional property that randomly chosen vectors tend to be nearly orthogonal (Ch. 2). In such spaces, vectors that are close together carry a strong signal. This is what makes graph-based navigability work: the "small world" edges connect genuinely meaningful neighbors, and the cosine similarity between random pairs is effectively noise near zero. The HNSW greedy walk depends on this geometric property of high-dimensional embedding spaces. Widdows & Cohen, Issue #45
IVF: Partitioning the Search Space
16Grounding note
Widdows and Cohen describe a related dimensionality reduction technique in Ch. 2 (Section 2.4): Latent Semantic Analysis (LSA) applies principal component analysis to term-document matrices to find the most significant axes and project vectors into a lower-dimensional space. IVF's k-means clustering operates on a similar intuition: rather than reducing dimensions, it partitions the full space into regions defined by centroids, achieving a different kind of data reduction that enables sub-linear search. Widdows & Cohen, Issue #45
17Grounding note
SLP3 §11.1.1 describes the vector space model (Salton, 1971) where documents are represented as vectors and similarity is measured by cosine. In that framework, IVF's k-means clustering can be understood as learning a coarse partitioning of the document vector space. Each centroid acts like a "prototype document," and the query-centroid comparison is a fast first-pass retrieval step, analogous to how sparse retrieval narrows candidates via the inverted index (§11.1.4) before scoring. The key difference is that IVF partitions geometric regions rather than vocabulary terms.
Product Quantization: Compressing Vectors
18Grounding note
Widdows and Cohen discuss quantization in a different but related context in Ch. 5 (Section 5.3.5). They explain how LLM weights are quantized from 32-bit floats down to FP16, FP8, or even single-bit precision to reduce memory. The principle is the same as Product Quantization for vectors: trade numerical precision for dramatic memory savings. They note that weight distributions are clustered near zero with a few outliers, and that a well-chosen reduced-precision format can approximate them effectively. PQ applies an analogous insight to embedding vectors rather than model weights. Widdows & Cohen, Issue #45
19Grounding note
SLP3 §5.3 provides useful context for understanding why this lossy compression is tolerable. Jurafsky and Martin show that sparse co-occurrence vectors are "mostly zeros (since most words simply never occur in the context of others)." Dense embeddings are already a compressed representation of the full |V|-dimensional co-occurrence space. PQ applies a second layer of compression on top of that, from float32 precision per dimension down to a codebook index. The information loss at each stage is mitigated by redundancy: most of the variance in embedding space is captured by a relatively small number of cluster centroids, just as most semantic information is captured by the first few hundred dimensions.
Choosing the Right Index
20Grounding note
SLP3 §11.3 confirms that practical dense retrieval systems rely on FAISS (Johnson et al., 2017) for efficient nearest neighbor search. The textbook frames the choice between sparse and dense retrieval as a fundamental architectural decision: sparse retrieval uses inverted indices on tf-idf or BM25 vectors, while dense retrieval uses approximate nearest neighbor algorithms on learned embeddings. This article's guidance on index selection can be read as the engineering-level detail beneath the textbook's high-level framing of "dense retrieval" as a single category.
21Grounding note
Alammar & Grootendorst demonstrate dense retrieval as nearest-neighbor search on embeddings, using FAISS to build and query indices. They frame the choice between semantic search (embedding-based) and keyword search (BM25) as a core tradeoff: semantic search catches paraphrases but misses exact terms, while keyword search is the reverse. Hybrid approaches combining both often perform best. See GH #5, Ch. 8.
The Database Landscape
22Grounding note
SLP3 §11.1.2 provides the formal derivation of the BM25 weighting scheme that Weaviate's hybrid search relies on. BM25 (Eq. 11.12) extends tf-idf by adding parameters k (controlling the balance between term frequency and IDF) and b (controlling document length normalization). Jurafsky and Martin note that reasonable defaults are k = [1.2, 2] and b = 0.75. The textbook also explains the vocabulary mismatch problem (§11.3): sparse retrieval "work[s] only if there is exact overlap of words between the query and document." This is precisely the gap that hybrid search bridges, combining BM25's exact matching strength with dense retrieval's synonym handling.
23Grounding note
Widdows and Cohen trace the BM25 weighting scheme back to tf-idf extensions in Ch. 2 (Section 2.3), noting that BM25 adjusts for document length and became a cornerstone of keyword-based retrieval. The hybrid approach Weaviate uses, blending BM25 with vector similarity, recombines these two lineages: the statistical keyword tradition that Widdows and Cohen describe from the 1960s onward, and the embedding-based semantic search that descends from their Ch. 2-3 vector space models. Widdows & Cohen, Issue #45
Metadata Filtering and Why It Matters for RAG
24Grounding note
SLP3 §11.1 defines a collection as the set of documents being searched, noting it "can mean the entire web" or "a smaller corporate repo, or even a set of documents used by one person." The metadata filtering described in this article is essentially the mechanism for dynamically defining the collection at query time. In the textbook's framework, pre-filtering restricts the collection before retrieval begins; post-filtering applies relevance judgments after. The textbook's treatment of document-level attributes (Fig. 11.1 shows document processing feeding into the index) implicitly assumes a fixed collection, making the dynamic filtering described here a practical extension of the formal model.
Putting It Together for RAG
25Grounding note
Widdows and Cohen describe RAG in Ch. 5 (Section 5.3.3), characterizing it as a computational compromise: "it's expensive to train a whole new language model for every domain, but relatively cheap to build a search engine, so we put these together into a hybrid system that does better than either component on its own." They also warn that RAG is easily misinterpreted, since retrieval-augmented answers are not produced directly from a database of established facts, even when marketed as referencing an "authoritative knowledge base." This is an important caveat for anyone building the pipeline this article describes. Widdows & Cohen, Issue #45
26Grounding note
SLP3 §11.1 (Fig. 11.1) diagrams the full ad hoc IR architecture: a document collection is processed and indexed, a query is processed into a query vector, and a search module computes relevance scores to produce ranked documents. The RAG pipeline described here maps directly onto this architecture, with the vector database serving as both the "Document Index" and the "Search" module in Jurafsky and Martin's diagram. The key extension is that RAG adds a generation step after retrieval, using an LLM to synthesize retrieved passages into an answer. The textbook notes that RAG addresses three LLM limitations: hallucination, inability to access proprietary data, and static knowledge cutoffs.
27Grounding note
SLP3 §5.1 provides a foundational reason why vector databases are "necessary but not sufficient." Jurafsky and Martin explain that lemmas are polysemous (have multiple word senses); the word mouse can refer to a rodent or a computing device. Static embedding models assign a single vector to each word, blending all senses together. This means the vector database faithfully indexes whatever the embedding model produces, polysemy and all. When a query about a computer mouse retrieves documents about rodents, the fault lies in the embedding, not the index. Contextual embeddings (§5.5 notes the shift to BERT in Chapter 9) partially address this, but the fundamental point stands: the index can only be as good as its input vectors.
28Grounding note
Widdows and Cohen illustrate this "necessary but not sufficient" point vividly in Ch. 3 (Section 3.4). They show that a single global embedding vector for Romeo conflates Shakespeare and Alfa Romeo sportscar meanings. In a RAG system backed by a vector database, a query about "Alfa Romeo" could retrieve documents about Shakespeare's plays as nearest neighbors. The vector database returns whatever is geometrically close, and if the embeddings are ambiguous, the results will be too. This underscores the article's point that embedding model selection is as critical as the index algorithm. Widdows & Cohen, Issue #45
29Grounding note
Alammar & Grootendorst make the point that embedding quality directly determines retrieval quality: a vector database is only as good as the vectors it stores. They show that switching from a generic embedding model to a domain-tuned one via contrastive fine-tuning can dramatically change which documents surface as nearest neighbors, even without changing the index algorithm. See GH #5, Ch. 10.