How Vector Databases Actually Work
Every RAG tutorial shows three lines of code to store and retrieve vectors. But the data structures underneath those three lines determine whether your system handles ten thousand documents or collapses at one million. Here is what is actually happening.
The Brute-Force Baseline
Before we discuss any clever indexing, it is worth understanding the naive approach. You have a collection of vectors, each representing a document chunk, an image, or a sentence. A user submits a query, which gets embedded into the same vector space. You need to find the k most similar vectors in your collection.
The simplest strategy is to compare the query vector against every single stored vector, compute a similarity score for each, sort the results, and return the top k. This is called a flat scan or brute-force search. It produces perfect recall because you literally check everything.
For a collection of 10,000 vectors with 1,536 dimensions (the output size of OpenAI's text-embedding-3-small), a flat scan works fine. On modern hardware, you can compute around 10,000 cosine similarities in a few milliseconds. But the cost is O(n) per query, where n is the number of vectors. That linear scaling is the problem.
At one million vectors, each query now requires one million distance computations. At 100 million, the math becomes punishing. If you are serving an application with hundreds of concurrent users, flat scan is not viable. You need some way to avoid comparing against most of the collection while still returning accurate results.
This is the core tradeoff in vector search: speed versus recall. Every algorithm we will look at sacrifices perfect recall in exchange for sub-linear query time. The question is always how much accuracy you are willing to lose and how fast you need the answer.
HNSW: The Dominant Algorithm
Hierarchical Navigable Small World graphs, introduced by Malkov and Yashunin in 2018, have become the default indexing strategy for most vector databases. If you are using Pinecone, Weaviate, Qdrant, or Chroma under the hood, you are almost certainly running HNSW. Understanding why requires a short detour into graph theory.
The Small World Property
In the 1960s, Stanley Milgram ran his famous "six degrees of separation" experiments, demonstrating that any two people in the United States could be connected through roughly six intermediary acquaintances. The social network, despite containing hundreds of millions of nodes, had a remarkably short average path length. Mathematicians later formalized this as the "small world" property: a graph where most nodes are not directly connected, yet any node can be reached from any other through a small number of hops.
HNSW exploits this property for vector search. The idea is to build a graph where each vector is a node, and edges connect vectors that are relatively close in the embedding space. To find the nearest neighbor of a query, you do not scan the entire collection. Instead, you start at an entry point and greedily walk the graph, always moving to the neighbor closest to your query. If the graph has the small world property, you reach the neighborhood of the true nearest neighbor in O(log n) steps.
The Layered Structure
A plain small world graph works, but it can get trapped in local minima. HNSW solves this with a hierarchy of layers, and the analogy to urban navigation is almost exact.
Imagine you are trying to reach a specific coffee shop in an unfamiliar city. You would not start by walking down residential side streets at random. You would take the highway to the right part of the metro area, exit onto an arterial road to reach the right neighborhood, then navigate local streets to the specific block. Each transition narrows your search area dramatically.
HNSW works the same way. The top layer is sparse, containing only a few nodes with long-range connections (the highways). Each subsequent layer adds more nodes with shorter-range connections. The bottom layer, Layer 0, contains every vector in the collection with connections only to nearby neighbors (the side streets).
When a query arrives, the algorithm starts at the top layer's entry point and greedily navigates to the closest node it can find. It then drops down one layer, using that node as the starting point in the denser graph below. This continues layer by layer until it reaches the bottom, where it performs a final local search among the densest set of connections.
The assignment of nodes to layers is probabilistic. Each node always appears in Layer 0, but it gets promoted to higher layers with exponentially decreasing probability. This creates the right distribution: very few nodes in the top layer (providing coarse, long-range navigation) and all nodes in the bottom layer (providing fine-grained accuracy).
Key Parameters
Two parameters control HNSW behavior in practice:
Mis the maximum number of connections per node. Higher values mean better recall but more memory and slower index construction. Typical values range from 16 to 64.ef_constructioncontrols how many candidates the algorithm considers when building the graph. Higher values produce a better graph but take longer to build. Values of 100 to 200 are common.ef_search(sometimes calledef) controls how many candidates are considered at query time. This is the knob you turn to trade latency for recall. Anef_searchof 50 is fast but might miss some results; 200 is slower but more thorough.
HNSW achieves recall rates above 0.95 at query times measured in single-digit milliseconds for collections of one million vectors. The tradeoff is memory: the graph structure itself consumes significant RAM on top of the raw vectors. For a million 1,536-dimensional float32 vectors, the raw data is about 6 GB. The HNSW index can add another 2 to 4 GB depending on M.
That memory cost is why HNSW is not always the right choice.
IVF: Partitioning the Search Space
Inverted File Index takes a fundamentally different approach. Instead of building a graph, IVF partitions the entire vector space into regions and only searches the regions most likely to contain the answer.
Voronoi Cells
The construction phase runs k-means clustering on the full dataset, producing a set of centroid vectors. Each data vector is assigned to its nearest centroid, forming what mathematicians call Voronoi cells: regions of space where every point is closer to that cell's centroid than to any other centroid. If you have 1,024 centroids, your vector space is carved into 1,024 regions, each containing a roughly equal share of the data.
At query time, the algorithm compares the query vector against all centroids (a much smaller set, typically 256 to 16,384), identifies the closest centroids, and then performs an exhaustive search only within those cells. If your query falls near the center of a cell, searching just that one cell is probably sufficient. If it falls near a boundary, you need to search adjacent cells too.
The nprobe Parameter
This is where nprobe comes in. It controls how many cells to search at query time. With nprobe=1, you only search the single closest cell, which is fast but risky; your true nearest neighbor might be just across the boundary in an adjacent cell. With nprobe=10, you search the ten closest cells, dramatically improving recall at the cost of searching ten times more vectors.
The relationship between nprobe and recall is not linear. Going from nprobe=1 to nprobe=5 might jump your recall from 0.70 to 0.92. Going from 5 to 20 might only push it from 0.92 to 0.98. The diminishing returns are steep, and tuning this parameter is one of the most important decisions in an IVF deployment.
IVF has a significant advantage over HNSW: the index is much smaller. The only additional data stored beyond the raw vectors is the set of centroids and the cell assignments. For memory-constrained environments, this matters.
Product Quantization: Compressing Vectors
Both HNSW and IVF assume you can fit all your vectors in memory. At scale, this assumption breaks. A billion 1,536-dimensional float32 vectors consume roughly 6 terabytes of RAM. Product Quantization (PQ), introduced by Jegou, Douze, and Schmid in 2011, addresses this by compressing vectors to a fraction of their original size.
How PQ Works
The core idea is surprisingly elegant. Take a 1,536-dimensional vector and split it into, say, 192 sub-vectors of 8 dimensions each. For each of these 192 sub-spaces, run k-means clustering independently to learn 256 representative centroids (called a codebook). Now, instead of storing the original 8-dimensional sub-vector, you store just the index of its nearest centroid, which is a single byte.
Your original vector of 1,536 float32 values (6,144 bytes) is now represented by 192 bytes. That is a 32x compression ratio. A billion vectors that required 6 TB now fit in roughly 192 GB.
Distance computation also gets faster. Instead of computing the full distance between two 1,536-dimensional vectors, you precompute a lookup table of distances between the query's sub-vectors and all codebook centroids. The approximate distance between the query and any compressed vector is then a sum of 192 table lookups. This is substantially cheaper than 1,536 floating-point multiplications.
The Accuracy Cost
Compression always loses information. PQ introduces quantization error because each sub-vector is snapped to its nearest codebook centroid. Two vectors that were slightly different in the original space might map to the same compressed representation, making them indistinguishable to the search algorithm.
In practice, PQ alone achieves recall rates of 0.80 to 0.90 on standard benchmarks. This sounds mediocre, but PQ is rarely used alone. The standard approach is IVF+PQ: use IVF to narrow the search to a few cells, then use PQ for fast approximate distance computation within those cells. The combination maintains reasonable recall (0.90 to 0.95 with proper tuning) while enabling datasets of a billion vectors on a single machine.
Facebook's FAISS library, described by Johnson, Douze, and Jegou in 2019, made this combination accessible through a well-optimized C++ implementation with Python bindings. FAISS remains the reference implementation that most vector databases either use directly or reimagine.
Choosing the Right Index
The decision of which algorithm to use is primarily a function of dataset size, memory budget, and accuracy requirements. Here is a practical guide.
| Scale | Recommended Index | Why |
|---|---|---|
| Under 100K vectors | Flat (brute-force) | Perfect recall, sub-millisecond latency. No reason to introduce approximation at this scale. |
| 100K to 1M vectors | HNSW | Excellent recall (>0.95), single-digit millisecond queries. Memory overhead is manageable. |
| 1M to 10M vectors | HNSW or IVF | HNSW if you have the RAM. IVF if memory is constrained. Both deliver strong recall at this range. |
| 10M to 100M vectors | IVF+PQ | HNSW memory costs become prohibitive. IVF+PQ trades some recall for dramatically lower memory usage. |
| 100M+ vectors | IVF+PQ or specialized (ScaNN, DiskANN) | At this scale, you are likely using disk-based indices or distributed systems. FAISS IVF+PQ on GPU is one path; Google's ScaNN or Microsoft's DiskANN are others. |
A common mistake is reaching for HNSW at every scale. For a prototype with 5,000 document chunks, a flat index is simpler, faster to build, and returns perfect results. For a production system with 50 million product embeddings, HNSW's memory requirements might force you toward IVF+PQ even if you would prefer the higher recall.
Start simple. Complicate when the data demands it.
The Database Landscape
A vector index is not a vector database. FAISS, Annoy, and ScaNN are libraries that implement indexing algorithms. A vector database wraps those algorithms with persistence, CRUD operations, metadata storage, filtering, replication, and an API layer. The distinction matters because choosing a database is as much about operational concerns as algorithmic ones.
Pinecone
Pinecone is a fully managed service. You never provision servers, configure indices, or manage replication. You send vectors through an API and query them back. The indexing algorithm is opaque; Pinecone chooses and tunes it based on your data profile. For teams that want vector search without infrastructure overhead, Pinecone is the obvious choice. The tradeoffs are cost (managed services are never cheap at scale), vendor lock-in, and limited control over indexing parameters.
Chroma
Chroma occupies the opposite end of the spectrum. It runs in-process as a Python library, storing data locally. For development, prototyping, and small-scale deployments, it is hard to beat. You can have a working vector store in four lines of code. Chroma uses HNSW internally (via the hnswlib library) and provides a clean API for adding documents with metadata. The limitation is scale: Chroma is designed for single-machine workloads, and its persistence layer is straightforward rather than battle-tested.
Weaviate
Weaviate's distinguishing feature is hybrid search. It combines vector similarity with traditional keyword search (BM25) in a single query, letting you blend semantic and lexical matching. For RAG applications where you want to catch both semantically similar and keyword-matching documents, this is genuinely useful. Weaviate also supports custom modules for vectorization, meaning you can configure it to embed text automatically on ingest. It runs as a standalone service, self-hosted or through their managed cloud.
Qdrant
Qdrant is written in Rust and emphasizes performance and filtering capabilities. Its payload filtering system is particularly strong, supporting complex filter conditions that execute efficiently alongside vector search. If your use case involves heavy metadata filtering (say, searching product embeddings filtered by category, price range, and availability), Qdrant handles this well. It offers both a managed cloud and self-hosted deployment.
Milvus
Milvus is the heavyweight option, designed for billion-scale deployments. It supports multiple index types (HNSW, IVF, IVF+PQ, DiskANN), GPU acceleration, and distributed deployment across multiple nodes. The operational complexity is real, involving multiple microservices, an etcd dependency for coordination, and MinIO or S3 for object storage. If your problem truly requires billion-vector scale with high availability, Milvus is one of the few open-source options that handles it. For anything smaller, the operational burden is hard to justify.
pgvector
The pragmatist's choice. pgvector is a PostgreSQL extension that adds vector similarity search to the database you are probably already running. It supports both flat and HNSW indices, integrates with standard SQL queries, and benefits from PostgreSQL's mature ecosystem of tooling, backups, and replication. The performance is not competitive with purpose-built vector databases at large scale, but for applications under a few million vectors where you want to avoid adding another database to your infrastructure, pgvector is compelling. Your vectors live alongside your relational data in a single, well-understood system.
Practical Comparison
| Database | Best For | Index Types | Deployment |
|---|---|---|---|
| Pinecone | Teams wanting zero infra management | Proprietary (auto-tuned) | Managed cloud only |
| Chroma | Prototyping, small-scale, local dev | HNSW | In-process / local server |
| Weaviate | Hybrid search (vector + keyword) | HNSW | Self-hosted / managed cloud |
| Qdrant | Complex metadata filtering | HNSW | Self-hosted / managed cloud |
| Milvus | Billion-scale, GPU acceleration | HNSW, IVF, IVF+PQ, DiskANN | Distributed / managed cloud |
| pgvector | Already using PostgreSQL, moderate scale | Flat, HNSW, IVF | PostgreSQL extension |
There is no universally correct choice. But there is a useful heuristic: if you are building a prototype or course project, use Chroma. If you are building a production application and already run PostgreSQL, try pgvector first. If you need managed infrastructure at scale, evaluate Pinecone. Only reach for Milvus or a distributed setup when your data genuinely demands it.
Metadata Filtering and Why It Matters for RAG
In a real RAG system, you rarely want to search your entire vector collection. You want to search vectors that match certain conditions: documents from a specific user, chunks from a particular source, records created after a certain date. This is metadata filtering, and how a vector database implements it has significant performance implications.
Pre-filtering vs. Post-filtering
There are two strategies. Post-filtering runs the vector search first to retrieve the top candidates, then applies metadata filters to remove non-matching results. This is simple to implement but has an obvious flaw: if most of your top-K vector matches do not satisfy the filter, you end up with far fewer results than requested. You asked for 10 documents from "user_42" but the nearest 100 vectors all belong to other users.
Pre-filtering applies the metadata filter first, restricting the vector search to only those vectors that match the conditions. This guarantees you get the requested number of results (assuming enough matching vectors exist), but it complicates the index. The vector search algorithm now operates on a dynamic subset of the data, which can interact poorly with precomputed index structures like HNSW graphs or IVF cell assignments.
Most modern vector databases use a hybrid approach. Qdrant and Weaviate implement what is sometimes called "filtered HNSW," where metadata conditions are checked during graph traversal. A node is only considered a candidate if it passes the filter. This avoids the pitfalls of pure post-filtering while keeping the efficient graph traversal structure.
For RAG applications, pre-filtering is almost always what you want. A common pattern is to store metadata like source, user_id, document_type, and created_at alongside each vector, then filter on these fields at query time. If your vector database handles this poorly, your RAG system will return irrelevant or insufficient results regardless of how good your embeddings are.
The index is only half the story. The filter is the other half.
Code: Working with Vector Databases
Let us ground all of this in actual code. The following examples demonstrate basic operations using Chroma (for local development) and show how the concepts we have discussed map to API calls.
Setting Up Chroma
↗ docs# Install: pip install chromadb import chromadb from chromadb.utils import embedding_functions # Create a persistent client (data survives restarts) client = chromadb.PersistentClient(path="./chroma_data") # Use the default embedding function (all-MiniLM-L6-v2) # or bring your own from OpenAI, Cohere, etc. embedding_fn = embedding_functions.DefaultEmbeddingFunction() # Create or get a collection collection = client.get_or_create_collection( name="course_documents", embedding_function=embedding_fn, metadata={"hnsw:M": 32, "hnsw:ef_construction": 128} )
Notice the metadata parameter on the collection. This is where you configure the HNSW index parameters we discussed earlier. An M of 32 means each node in the graph maintains up to 32 connections. The ef_construction of 128 controls the quality of the graph during index building.
Adding Documents with Metadata
↗ docs# Add documents with metadata for filtering collection.add( documents=[ "Vector databases use approximate nearest neighbor algorithms.", "HNSW builds a navigable small-world graph for fast search.", "Product quantization compresses vectors to reduce memory.", "IVF partitions the vector space into Voronoi cells.", "Brute-force search compares every vector but does not scale.", ], metadatas=[ {"topic": "overview", "week": 4}, {"topic": "hnsw", "week": 4}, {"topic": "quantization", "week": 4}, {"topic": "ivf", "week": 4}, {"topic": "baseline", "week": 4}, ], ids=["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"] )
Each document gets an embedding generated automatically by the embedding function. The metadata fields (topic, week) are stored alongside the vector and can be used for filtering at query time.
Querying with Metadata Filters
↗ docs# Simple semantic search results = collection.query( query_texts=["How do graph-based indices work?"], n_results=3 ) for doc, dist in zip(results["documents"][0], results["distances"][0]): print(f" [{dist:.4f}] {doc}") # Output: # [0.3821] HNSW builds a navigable small-world graph for fast search. # [0.7234] Vector databases use approximate nearest neighbor algorithms. # [0.8901] Brute-force search compares every vector but does not scale.
↗ docs# Filtered search: only search documents about specific topics filtered_results = collection.query( query_texts=["How can I reduce memory usage?"], n_results=2, where={"topic": {"$in": ["quantization", "ivf"]}} ) for doc in filtered_results["documents"][0]: print(f" {doc}") # Output: # Product quantization compresses vectors to reduce memory. # IVF partitions the vector space into Voronoi cells.
The where clause applies a pre-filter before vector search, ensuring results come only from documents matching the metadata condition. This is the filtered search pattern critical for multi-tenant RAG systems.
Using FAISS Directly
For more control over the indexing algorithm, you can use FAISS directly. This example builds and queries an IVF+PQ index.
↗ docsimport numpy as np import faiss # Generate sample data: 100,000 vectors of dimension 128 np.random.seed(42) dimension = 128 n_vectors = 100_000 data = np.random.randn(n_vectors, dimension).astype("float32") # Build a flat index (brute-force baseline) flat_index = faiss.IndexFlatL2(dimension) flat_index.add(data) # Build an IVF+PQ index # 256 Voronoi cells, vectors split into 16 sub-quantizers, 8 bits each n_cells = 256 n_subquantizers = 16 n_bits = 8 quantizer = faiss.IndexFlatL2(dimension) ivfpq_index = faiss.IndexIVFPQ(quantizer, dimension, n_cells, n_subquantizers, n_bits) # IVF+PQ requires training on representative data ivfpq_index.train(data) ivfpq_index.add(data) # Set nprobe: how many cells to search ivfpq_index.nprobe = 10 # Query query = np.random.randn(1, dimension).astype("float32") # Compare results flat_distances, flat_ids = flat_index.search(query, k=5) ivfpq_distances, ivfpq_ids = ivfpq_index.search(query, k=5) print("Flat (exact):", flat_ids[0]) print("IVF+PQ (approx):", ivfpq_ids[0]) # Measure recall: how many of IVF+PQ's results are in flat's results? recall = len(set(flat_ids[0]) & set(ivfpq_ids[0])) / 5 print(f"Recall@5: {recall:.2f}")
Notice the train step. Unlike HNSW (which builds incrementally), IVF+PQ needs to learn its Voronoi centroids and PQ codebooks from a representative sample of data before you can add vectors. This is a one-time cost, but it means IVF+PQ is less suitable for collections that change rapidly.
Measuring the Speed Difference
import time # Benchmark: 1000 queries against each index queries = np.random.randn(1000, dimension).astype("float32") start = time.perf_counter() flat_index.search(queries, k=10) flat_time = time.perf_counter() - start start = time.perf_counter() ivfpq_index.search(queries, k=10) ivfpq_time = time.perf_counter() - start print(f"Flat: {flat_time:.3f}s for 1000 queries") print(f"IVF+PQ: {ivfpq_time:.3f}s for 1000 queries") print(f"Speedup: {flat_time / ivfpq_time:.1f}x") # Typical output at 100K vectors: # Flat: 0.842s for 1000 queries # IVF+PQ: 0.031s for 1000 queries # Speedup: 27.2x
A 27x speedup at 100K vectors. At 10 million vectors, the gap grows to several hundred times. That is the difference between a usable application and one that times out.
Putting It Together for RAG
In the context of a RAG pipeline, the vector database sits between the embedding model and the language model. Documents are chunked, embedded, and stored during ingestion. At query time, the user's question is embedded, the vector database returns the most relevant chunks, and those chunks are passed to the language model as context.
The choice of indexing algorithm affects three things that matter to end users: latency (how long until they see a response), relevance (whether the retrieved context actually answers their question), and cost (how much infrastructure you need to run).
For a course project or prototype with a few thousand documents, a flat index through Chroma gives you perfect recall with negligible latency. For a production system with millions of documents, you will likely need HNSW through a managed service like Pinecone or a self-hosted solution like Qdrant, with careful attention to metadata filtering for multi-tenant isolation.
The algorithms are the foundation. But the engineering decisions around chunking strategy, embedding model selection, metadata schema design, and filter architecture are what determine whether a RAG system works well in practice. The vector database is necessary infrastructure, not sufficient infrastructure.
Get the index right, then focus on everything around it.
References
Textbook grounding and extended commentary: Sources.
- Malkov, Y. A. & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836.
- Jegou, H., Douze, M. & Schmid, C. (2011). "Product Quantization for Nearest Neighbor Search." IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117-128.
- Johnson, J., Douze, M. & Jegou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data, 7(3), 535-547.
- Milgram, S. (1967). "The Small World Problem." Psychology Today, 2(1), 60-67.
- Watts, D. J. & Strogatz, S. H. (1998). "Collective dynamics of 'small-world' networks." Nature, 393(6684), 440-443.
- Chroma. (2024). "Chroma Documentation." trychroma.com.
- FAISS. (2024). "FAISS: A library for efficient similarity search." Facebook Research, GitHub.