← All Articles

GraphRAG: When the Index Is a Graph

GraphRAG is not vector RAG with more sophistication. The index is a typed knowledge graph extracted by a language model at ingest, and retrieval is graph traversal or community-summary aggregation rather than similarity search. Indexing is expensive in language-model calls; retrieval is richer for global queries that span the whole corpus. The decision is about workload shape, not retrieval quality.

What "Graph" Means Here

A knowledge graph, in the sense GraphRAG uses the term, is a collection of labeled entities (nodes) joined by typed relationships (edges), with both nodes and edges carrying properties such as descriptions, source-document references, and aggregated importance scores. The structural intuition is that nodes connected by edges encode heterogeneous and relational information in a form that flat text retrieval cannot represent directly.² A GraphRAG system can be decomposed into a small number of components: a query processor, a retriever, an organizer, a generator, and a data source. The graph itself is the data source, not the retriever, which sharpens the architectural framing: the graph is structurally upstream of retrieval, not an alternative similarity function bolted onto an otherwise-vector pipeline.³

The vector-RAG model the previous article in this series walked is built on a single idea: similarity in a high-dimensional embedding space is a good-enough proxy for relevance. The pipeline chunks documents, embeds each chunk into a fixed-dimensional vector, indexes the vectors for approximate-nearest-neighbor search, and at query time returns the top-k chunks whose embedding is closest to the query embedding. The retrieval primitive is "find the points near this point." GraphRAG replaces that primitive with something different: compose an answer by walking the structure between concepts, or, in the global-query case, aggregate pre-computed summaries of the regions of the structure most relevant to this question. The pipeline still ends at a language model receiving augmented context, which keeps GraphRAG inside the RAG definition, but the index and the retrieval step are different objects from the ones in a vector pipeline.⁶

The architectural claim is that GraphRAG indexes structure where vector RAG indexes text.¹ A vector index does not know that the documents it stores reference recurring entities, and it does not know which relationships hold between those entities. A graph index encodes both, and its retrieval primitives can use both. That is the load-bearing structural difference.

Three failure modes of flat text retrieval motivate the graph alternative: complex query understanding in professional contexts, knowledge integration across distributed sources, and system efficiency bottlenecks at scale.⁴ The first two are failures a graph can address by encoding relationships across documents. The third is the cost lever the indexing pipeline pulls on: pre-computing structure at ingest amortizes work that flat retrieval would otherwise repeat per query. GraphRAG front-loads the LLM calls into ingest, where they amortize across thousands of future queries. Vector RAG spreads that same compute across per-query inference, where it never amortizes. The math favors graphs at high query volume and punishes them at low query volume.

. . .

The GraphRAG Anchor Paper

The reference point for the current GraphRAG conversation is a Microsoft Research paper from 2024, revised in 2025, that does three things at once: names a failure mode of vector RAG, proposes a two-stage indexing architecture that addresses the failure mode, and reports a concrete cost number that has framed every conversation about GraphRAG economics since.⁵

The failure mode is what the paper calls global queries: questions that ask something about the entire corpus rather than about a specific passage. "What are the main themes in this dataset?" is the canonical example. A top-k vector retrieval cannot answer this kind of question, because no single chunk contains the answer; the answer is a property of the corpus considered as a whole. Baseline vector RAG also struggles when answering a question requires traversing information spread across many documents, and when the question demands a synthesis over the whole collection rather than a passage retrieval.⁶

The proposed architecture splits indexing into two stages. Stage one is entity-and-relationship extraction, run chunk by chunk over the corpus with a language model in the loop. The result is a graph whose nodes are the entities the model surfaced and whose edges are the relationships it surfaced between them. Stage two runs community detection over that graph, partitioning the nodes into clusters of closely connected entities, and then pre-generates a natural-language summary of each community using another language-model call. The output of the indexing pipeline is therefore not just a graph; it is a graph plus a hierarchy of pre-written community-level summaries that the query pipeline will consume.

The query pipeline composes an answer rather than retrieving a chunk. Given a question, each community summary is used to generate a partial response, and all the partial responses are summarized again into a final response to the user. The shape is map-reduce: map the question over every community summary, reduce the partial answers into one. Microsoft taxonomizes this as Global Search, and distinguishes it from three other modes built on the same index: Local Search (reasoning about a specific entity by fanning out to its neighbors), DRIFT Search (local reasoning with added community context), and Basic Search (a fall-back to baseline RAG when vector retrieval is genuinely the right primitive).¹ Each mode consumes a different slice of the same indexed structure.

The cost number that has framed every conversation since is also from this paper: graph indexing with a 600-token window took 281 minutes for a single ~1M-token corpus. A vector-RAG index over the same corpus would take seconds to minutes on the same hardware, because each chunk requires a single forward pass through an embedding model rather than a multi-step language-model extraction. The compensating efficiency arrives at query time: the community-summary stage produces a representation that requires 9 to 43 times fewer tokens per query than reading source text directly to summarize. The indexing investment shifts work into the index so the query becomes cheaper, but only on the queries the index was built to serve.

The paper's authors flag their own generalization limit directly: the evaluation focused on sensemaking questions specific to two corpora of roughly 1 million tokens each, and more work is needed to understand how performance generalizes to datasets from various domains. The strongest GraphRAG result in the literature is defined on two corpora and one workload type, and the comparative benchmarks discussed later in this article exist to probe how well the architecture transfers beyond that setting.

. . .

The Indexing Pipeline

The canonical pipeline has four stages: slice the input corpus into TextUnits that act as analyzable units, extract all entities and relationships and key claims from those units, perform a hierarchical clustering of the resulting graph using the Leiden technique, and generate summaries of each community from the bottom up. Each stage is a non-trivial language-model operation, and the cost of running them at scale is the central economic fact about the architecture. This section walks the stages in the order a document encounters them.

Four-stage indexing pipeline.

Chunking is shared territory with vector RAG, and the considerations are largely the same: too small and entities get fragmented across boundaries, too large and the extraction prompt loses signal-to-noise. The 600-token window used in the published Microsoft benchmark is a reasonable starting point, though the right value is corpus-specific. The chunks become the input to the entity-extraction step, which is where GraphRAG diverges from anything a vector pipeline does.

Entity extraction is implemented as a structured language-model call. The model is prompted to enumerate the important entities in a chunk along with their types and descriptions, then to enumerate the relationships between those entities along with the relationship descriptions. The output is parsed into a candidate-graph fragment and merged with the fragments extracted from every other chunk in the corpus. In the Microsoft reference implementation this is encapsulated by a single module at packages/graphrag/graphrag/index/operations/extract_graph/extract_graph.py, which calls into an inner strategy named for "graph intelligence entity extraction."⁸

The dependence on prompt engineering at this step is the architecture's single most consequential operational fact. The quality of the graph is bounded above by the quality of the extraction prompt, and getting that prompt right is part of the per-domain configuration burden.⁹ A medical corpus, a legal corpus, and a financial corpus each want different entity types and different relationship vocabularies, and the prompt is where those expectations are encoded. Microsoft ships default prompts oriented around generic types like people, places, organizations, and things; specialized domains need their own. The same observation surfaces in the LangChain integration pattern, where an LLMGraphTransformer automates graph construction by analyzing text and suggesting an entity/relationship structure, but the suggestion is only as good as the language model and the schema it is given to work with.¹⁴

Once each chunk's contribution is extracted, the graph-construction step merges the fragments into a single corpus-level graph. Entities that the language model surfaced under different surface forms (acronyms, case variants, partial mentions) need to be reconciled to the same node, and relationships need to be deduplicated and aggregated. The Microsoft pipeline runs a series of normalization passes here, but the underlying problem is the classic entity-resolution problem from information extraction, and it does not have a clean general solution. A graph constructed without careful resolution will be sparser than it should be, with the same entity appearing as several disconnected nodes, and the community-detection step downstream will partition it incorrectly. In production deployments the resolution layer is often where the most engineering goes.

Community detection turns the graph into a hierarchy of regions. In the Microsoft implementation it is invoked by another single module at packages/graphrag/graphrag/index/operations/cluster_graph.py, which imports hierarchical_leiden and exposes a cluster_graph(edges, max_cluster_size, use_lcc, seed) function that returns a Communities object.⁸ The output is a partition of the graph's nodes into clusters, computed at multiple levels of granularity so that small focused communities and large thematic communities are both available to the query pipeline. The next section discusses why Leiden, specifically, is the algorithm Microsoft chose and why the choice matters.

The final stage is summary generation. Each community in the hierarchy gets a language-model-generated natural-language summary, written from the community's nodes, edges, and the source-text claims that produced them. The summaries are the artifact that makes global-query retrieval cheap at run time: instead of reading the corpus to answer a thematic question, the system reads the pre-written summaries of every community whose theme intersects the question. The work has been done at index time, and the query pays for the aggregation rather than the reading.

The Microsoft repository's README carries an explicit warning the maintainers chose to put in front of every user, advising new adopters that indexing is expensive and that the right way to start is small.⁹ The expensive indexing is the price of admission, and the rest of the system is designed to make the price worth paying for the right workloads.

. . .

The Retrieval Mechanism

The retrieval side of GraphRAG is where the global/local distinction starts to feel less like a marketing framing and more like an architectural reality. The two retrieval modes use the indexed structure in genuinely different ways, and the right way to think about them is as two different retrievers behind one common index.

The local-retrieval mode treats the graph as a neighborhood-traversal substrate. A question like "what does the corpus say about Entity X?" first resolves the entities the question mentions to nodes in the graph, then fans out to the neighbors of those nodes along the relationships the question implies. The retrieved context is the entity's neighborhood: its directly connected entities, the relationships between them, the source-text references that produced the edges. The neighborhood is assembled into a prompt and handed to the generation model. This mode is the closest GraphRAG comes to a traditional retrieval primitive, and on simple local questions it is genuinely competitive with vector RAG, in part because the extraction step is itself a form of compression of the source text into the structure relevant to the question.

The global-retrieval mode is the mode that justifies the entire architecture, and it is genuinely different from anything a vector system does. A question like "what are the main themes in this dataset?" cannot be answered by any retrieval primitive that returns a fixed-size top-k from a flat index, because the answer is not in any single chunk. If the question addresses the entire dataset, all input texts should be considered, not just the top-k most similar.⁷ The global-search retriever does not return chunks. It iterates over the community summaries produced at indexing time, asks the language model to extract from each summary the parts relevant to the user's question, and then asks the language model again to reduce the partial extractions into a single coherent answer. The shape is map-reduce, applied to a corpus of pre-written summaries rather than to the raw text.

The hierarchical nature of the community structure gives the global retriever an additional knob: which level of the hierarchy to operate on. Community summaries exist at multiple levels (root-level C0, then C1, C2, C3 in increasingly fine partitions). Broader levels require dramatically fewer tokens than finer levels, at the cost of less detail. The 9 to 43 times token-reduction figure cited earlier is specifically the C0-versus-source-text comparison. A system tuning for cost can operate at higher levels of the hierarchy; a system tuning for detail can operate at lower levels. The hierarchy is the lever.

DRIFT Search is the hybrid between local and global. It uses community information to set context for a local question, threading the global theme through the local neighborhood traversal. The mode exists to handle questions that are neither pure-local ("tell me about Entity X") nor pure-global ("what are the themes here?") but a mixture ("how does Entity X relate to the broader theme of Y across the corpus?"). The fourth mode, Basic Search, is the explicit acknowledgment that GraphRAG does not subsume vector RAG: it falls back to baseline RAG for questions where vector retrieval is genuinely the right primitive.

Other implementations take a different stance on the same problem space, treating retrieval as a choice among parallel retrievers rather than as a fixed pipeline. The Neo4j GraphRAG Python library ships nine retriever classes (VectorRetriever, VectorCypherRetriever, HybridRetriever, HybridCypherRetriever, ToolsRetriever, Text2CypherRetriever, and three vector-database integrations). The Text2CypherRetriever, for instance, asks an LLM to generate a Cypher query that fetches exactly the information needed from the graph. The retriever-pluralism is informative: the same graph index can serve as a substrate for vector similarity, traversal, hybrid retrieval, or text-to-Cypher generation, and the choice of retriever is a per-query (or per-application) decision rather than an architectural commitment.¹³

LlamaIndex's PropertyGraphIndex makes a similar pluralism explicit on both indexing and retrieval sides, exposing four parallel sub-retrievers (LLMSynonymRetriever for keyword expansion, VectorContextRetriever for similarity-then-traversal, TextToCypherRetriever, and CypherTemplateRetriever for parameter-filled templates).¹⁵ The choice of retriever is a configuration option, not a property of the architecture. The pattern is consistent across the three reference implementations the field has converged on: a graph index supports multiple retrieval primitives, and the question of which one to use is a tuning decision rather than an architectural one.

. . .

Leiden and Why It Matters

Microsoft GraphRAG uses Leiden community detection, applied hierarchically. The choice is not casual. Leiden exists specifically to fix a defect in the older Louvain algorithm that GraphRAG would have inherited if Microsoft had picked Louvain instead.¹⁰ Since the community-detection step determines which entities end up summarized together, and since the summaries are what the global-retrieval mode reads, the choice of algorithm is structurally load-bearing.

The defect in Louvain is that the algorithm can produce arbitrarily badly connected communities, including communities that are not connected at all. A disconnected community is a community in the partition that contains nodes with no path between them. The Louvain authors' own characterization is that, in practice, up to 25 percent of communities can be badly connected and up to 16 percent can be fully disconnected, especially when the algorithm is run iteratively. For most graph applications this is a cosmetic problem; for GraphRAG it is a substantive one, because a disconnected community means the summary the language model writes for that community bundles unrelated entities together as if they were a coherent thematic region. The summary then misleads every global query that reaches that community.

The Leiden algorithm fixes the defect by construction. The algorithm yields communities that are guaranteed to be connected, and, when applied iteratively, converges to a partition in which all subsets of all communities are locally optimally assigned. The connected-community guarantee is what makes the algorithm appropriate for GraphRAG: every community the algorithm produces is a coherent region of the graph in the path sense, so the summary the language model writes for that community is a summary of an actual thematic region rather than an arbitrary cluster. The Microsoft implementation runs Leiden hierarchically at multiple cluster sizes and exposes the result through the hierarchical_leiden import in the cluster-graph module.

Leiden also runs faster than Louvain in practice, because it relies on a fast local-move approach. The speed difference matters at corpus scale, where the community-detection step is one of the dominant index-time costs. The combination of correctness (connected communities by construction) and speed (faster than the alternative) is why the algorithm became the standard choice for hierarchical graph clustering in the years between the 2019 algorithmic paper and the 2024 GraphRAG paper, and why Microsoft cited it by name in the anchor publication.

The practical takeaway: the community-detection algorithm matters because the community summaries are what global-query GraphRAG reads. If the communities are arbitrary partitions, the summaries are arbitrary, and the answers to global queries inherit the arbitrariness. Leiden is the architectural decision that gives the global-query path its theoretical grounding, and the choice would not be defensible if Microsoft had used Louvain instead.

. . .

The Indexing-Cost Economics

Every conversation about whether GraphRAG is the right architecture for a given workload eventually arrives at one number: the cost of indexing. The 281-minute figure for a single ~1M-token corpus is the public anchor, and it is worth taking seriously not as a benchmark that other implementations have to hit but as evidence about the shape of the cost. The indexing pipeline runs an LLM extraction call per chunk plus LLM summary calls per community across the levels of the hierarchy. For any pipeline that uses an LLM at those stages the wall-clock cost scales roughly linearly with corpus size, and the dollar cost scales linearly with whatever the provider charges per token.

The cost can be reduced substantially by trading the LLM-based extraction step for a cheaper alternative. A dependency-parsing pipeline that replaces the LLM extraction recovers about 94 percent of LLM-based performance (61.87 percent versus 65.83 percent on the same benchmark) at a fraction of the cost.¹⁹ The headline finding is not that LLM extraction is wrong but that it is over-engineered for many practical settings, and that most of the quality can be retained without the per-chunk LLM calls.

The per-query economics are equally important. The most useful published numbers come from a comparative benchmark across four systems on the same evaluation corpora, summarized in the table below.²³

Per-query token cost on the Novel corpus benchmark.

The vanilla-RAG row is the baseline: roughly 900 tokens per query, which is about the size of a single chunk-and-prompt context. The MS-GraphRAG (global) row is two orders of magnitude larger, with the heaviest queries reaching prompt sizes around 40,000 tokens before the map-reduce structure multiplies that across communities and the final reduction step. HippoRAG2 sitting near vanilla RAG and LightRAG sitting between the two is informative: not every graph-based system inherits the cost shape of MS-GraphRAG, and the per-query economics depend heavily on which retrieval strategy operates over the graph. A system that summarizes by reading every community summary will be expensive; a system that retrieves entity neighborhoods or runs PageRank over the graph need not be.

There is an inverse argument worth holding alongside the table. The same community summaries that make MS-GraphRAG global queries expensive also make them cheaper than the alternative of summarizing source text directly. Against a baseline that reads the entire corpus into a single prompt and asks for a summary, the indexing investment pays off at query time by a factor of 9 to 43. Against vanilla vector RAG on simple lookups, it does not. The right comparison depends on the workload, not on a single benchmark number.

Corpus updates are the third axis of cost. On incremental updates, MS-GraphRAG can require around 610,000 tokens and hundreds of API calls where LightRAG consumes fewer than 100 tokens with a single API call.²⁶ The gap is not a curiosity. For workloads with frequent corpus updates, the difference between a system that re-indexes large portions of the graph on every update and a system that integrates new entities and relationships without full reconstruction is the difference between operationally viable and operationally infeasible.

The economics calculation a team has to run before adopting GraphRAG is therefore multi-variable. The variables include the size of the corpus, the rate of change in the corpus, the workload mix (global queries versus local queries versus simple lookups), the per-token cost of the language model at extraction and summarization, and the operational tolerance for re-indexing latency when the corpus updates. The architecture wins, decisively, when the corpus is large and stable and the queries are global and frequent. The architecture loses, decisively, when the corpus changes frequently and the queries are simple lookups. Most real workloads sit somewhere between those extremes, which is why the next section turns to the hybrid patterns that combine graph and vector retrieval in the same pipeline.

. . .

The Implementation Ecosystem

The Microsoft GraphRAG repository is the reference implementation, both in the sense that it was the first to ship as a coherent open-source project and in the sense that almost every subsequent system positions itself as a successor, simplification, or alternative to it. The project is a data pipeline and transformation suite that extracts structured data from unstructured text using LLMs, ships under an MIT license, and provides the full indexing pipeline along with the four query modes (Global, Local, DRIFT, Basic) discussed earlier. It is the system people are usually thinking of when they say "GraphRAG" without further qualification.

The Neo4j GraphRAG Python package is the second reference point a team is likely to encounter, positioned as the official first-party Neo4j entry into the space.¹² The Neo4j approach is different from Microsoft's in an important way: it does not commit to one indexing pipeline or one retrieval pattern. It exposes a library of retriever classes and lets the application choose how to compose them. The graph backing the retrievers is a Neo4j property graph, which integrates with the rest of the Neo4j ecosystem (Cypher query language, native graph algorithms, vector index plugin). A team that already runs Neo4j for other purposes will find this the most natural entry point.

LlamaIndex's PropertyGraphIndex is the third major implementation, and it is the one that most explicitly surfaces the schema-free versus schema-driven choice the next section will discuss. The framework defines a property graph as a collection of labeled nodes with properties, linked together by relationships into structured paths, and exposes four parallel extractors: SimpleLLMPathExtractor (LLM-extracted single-hop triples), ImplicitPathExtractor (no LLM, uses existing node.relationships attributes), DynamicLLMPathExtractor, and SchemaLLMPathExtractor (typed-schema validation). The pluggable design lets a team trade off cost (ImplicitPathExtractor uses no LLM at all) against quality (SchemaLLMPathExtractor validates against a typed schema) without changing the rest of the pipeline.

The earlier LlamaIndex KnowledgeGraphIndex API is now deprecated as of version 0.10.53 in favor of PropertyGraphIndex.¹⁶ The deprecation is informative as historical signal: the field's earlier generation of GraphRAG-adjacent systems (pre-2024) used flat (subject, predicate, object) triplets, and the current generation uses property graphs with typed nodes and edges. A triplet store is less expressive than a property graph, and the latter has become the default representation.

The LangChain / Neo4j integration is the fourth pattern worth knowing about. The integration uses an LLMGraphTransformer to automate knowledge-graph creation by analyzing text and proposing an entity-relationship structure, and frames the hybrid graph-plus-vector pattern as combining structured graph data with vector search over unstructured text.¹⁴ The next section is about that hybrid pattern as a distinct architecture.

The ecosystem is therefore not single-vendor and not even single-shape. Microsoft GraphRAG is the structured pipeline with global-search at the center. Neo4j is the database-backed graph with a library of retriever classes. LlamaIndex is the framework-level abstraction with pluggable extractors and retrievers. LangChain / Neo4j is the chain-builder pattern that composes the same primitives differently. A team picking among them is picking a primary surface (pipeline, database, framework, or chain) more than picking an architectural commitment. The graph is portable; the surfaces are not.

. . .

Hybrid GraphRAG

The empirical question of whether graph retrieval and vector retrieval should compete or compose has been studied directly, and the published evidence so far suggests they should compose. On financial earnings-call transcripts, a hybrid retriever that draws from both a vector database and a knowledge graph outperforms either component alone at both retrieval and generation.¹⁷ The framing matters: hybrid retrieval is not a tuning knob on a vector pipeline or a graph pipeline. It is a separately-measurable third architecture.

A second study on Open Radio Access Network technical specifications confirms the pattern: both GraphRAG and Hybrid GraphRAG outperform traditional RAG, with Hybrid GraphRAG improving factual correctness by 8 percent and GraphRAG improving context relevance by 11 percent, measured against the standard RAG metrics of faithfulness, answer relevance, context relevance, and factual correctness.¹⁸ The point is not that the absolute numbers transfer to other domains; the point is that two independent peer-reviewed studies on substantively different corpora both find that the hybrid architecture beats either of its components alone.

The mechanism that ties graph and vector together is Reciprocal Rank Fusion (RRF), the same score-blending technique discussed in the classic-search article for combining BM25 and dense retrieval. In the GraphRAG case, two retrievers produce ranked lists (one from vector similarity over text chunks, one from graph traversal over entities and relations), RRF blends the ranks, and the combined ranking goes downstream. Maintaining separate embeddings for entities, chunks, and relations enables multi-granular matching that a single embedding space cannot do. Reported gains are up to 15 percent and 4.35 percent over vanilla vector retrieval under LLM-as-Judge evaluation.¹⁹ The retrievers do not have to share an output space because RRF only uses rank position.

The Neo4j GraphRAG library's VectorCypherRetriever class is the productized version of the same pattern, combining vector similarity with graph traversal. The HybridRetriever class goes further, combining a vector index with a Lucene full-text index over the same graph. The library's existence is evidence that the pattern is stable enough to be worth packaging as a default retriever, which in turn is evidence that the field has converged on hybrid as a real third architecture rather than a tuning curiosity.

The strongest practical implication of the hybrid finding is that the choice is not always graph-or-vector. For teams that already have a vector-RAG system running, adding a graph component (over the same corpus, or over a subset of it) is a defensible incremental investment, with a literature that suggests it will pay off on at least some of the workload. The cost calculation still applies: graph indexing is still expensive, and the marginal queries that benefit from the graph have to be frequent enough to justify the index. The choice is not all-or-nothing, and the hybrid pattern lets a team buy the parts of GraphRAG that are most valuable for their workload without committing to the architecture wholesale.

. . .

HippoRAG and LightRAG

Microsoft GraphRAG is the canonical reference, but it is not the only graph-grounded retrieval system worth knowing about. Two systems published in 2024 take the same starting premise (LLM-extracted graph as retrieval substrate) and explore different points in the design space, and both have published numbers that complicate the simple "Microsoft GraphRAG versus vector RAG" comparison.

HippoRAG is the more theoretically grounded of the two. The architecture replaces community summarization with Personalized PageRank over the LLM-extracted graph: instead of pre-generating summaries of regions of the graph, the system runs PageRank biased toward the entities the query mentions and ranks documents by their PageRank-weighted relevance. Reported gains reach up to 20 percent on multi-hop question answering, and single-step retrieval with HippoRAG matches or beats iterative retrieval methods at 10 to 30 times lower cost and 6 to 13 times higher speed.²⁵ The HippoRAG2 numbers in the per-query cost table earlier (~1,000 tokens/query) are about three orders of magnitude lower than MS-GraphRAG global (~330,000 tokens/query), which is the practical consequence of replacing community-summary aggregation with PageRank-based ranking.

LightRAG is the more pragmatic of the two and positions itself directly against Microsoft GraphRAG. The system retains the LLM-extracted graph but replaces community summarization with dual-level retrieval. The low-level mode retrieves specific entities along with their attributes and relationships; the high-level mode aggregates across multiple related entities to address broader topics and overarching themes. The retrieval primitive blends graph traversal with vector embeddings of the graph elements, which is structurally closer to a hybrid retrieval pattern than to community-summary aggregation. The cost numbers reported earlier (610,000 tokens versus fewer than 100 on retrieval, and seamless incremental updates against full reconstruction) are the strongest argument the paper makes: LightRAG preserves the most useful properties of GraphRAG while substantially reducing the per-query and update-time costs.

The two systems are useful as comparison points for a third reason beyond the cost numbers: they demonstrate that the design space inside GraphRAG is genuinely open. Microsoft's community-summarization step is one design choice; HippoRAG's Personalized PageRank is another; LightRAG's dual-level entity retrieval is a third. Each makes different assumptions about which queries it expects to serve and which costs it expects to pay. A team adopting a GraphRAG-style architecture is adopting a class of architectures, not a single pipeline.

The honest comparison is that none of the three dominates the others on every workload. Microsoft GraphRAG's global search is the strongest performer on the sensemaking questions it was evaluated on, but it pays the highest per-query cost to get there. HippoRAG is the strongest performer on multi-hop question answering and the cheapest per query, but its retrieval primitive is less suited to the "what are the main themes here?" question. LightRAG occupies the middle: cheaper than Microsoft, more theme-aware than HippoRAG, but without the strongest result on either axis. For a team that does not already know which axis matters most for their workload, the right reading is that the choice is workload-dependent, and the comparative benchmarks in the next section are where the workload-to-architecture mapping starts to become legible.

. . .

Comparative Benchmarks

The literature on when GraphRAG beats vector RAG and when it does not has matured quickly. The most useful framework for making the call is a four-level task taxonomy that progressively scales retrieval difficulty and reasoning complexity.²³ Level 1 is Fact Retrieval (find a specific fact stated in the corpus); Level 2 is Complex Reasoning (compose multiple facts across documents); Level 3 is Contextual Summarize (summarize a region of the corpus); Level 4 is Creative Generation (generate new content informed by the corpus).

Vanilla RAG versus GraphRAG by task complexity.

The findings are unambiguous in two directions. On Level 1, vanilla RAG is comparable to or outperforms GraphRAG on simple fact-retrieval tasks that do not require reasoning across connected concepts. On Levels 2 through 4, GraphRAG shows a clear advantage. The retrieval-stage numbers tell the same story: vanilla RAG hits 83.2 percent Evidence Recall on discrete fact questions, while GraphRAG's advantages emerge as questions grow more complex. The pattern is practitioner-usable: if the workload is dominated by fact-lookup questions, GraphRAG is overengineering, and a vanilla vector pipeline is both cheaper and equally accurate. If the workload is dominated by reasoning-intensive or summarization-intensive questions, GraphRAG is the right architecture, and the indexing investment pays off in per-query accuracy. The mistake to avoid is conflating the two and either over-adopting GraphRAG for easy cases or under-adopting it for hard ones.

A second study supplements the taxonomy with a methodological argument about how benchmarks should be conducted in the first place. Existing GraphRAG evaluations have been tailored to specific tasks, datasets, and system designs, with heterogeneous evaluation protocols making cross-paper comparison unreliable. Under a unified protocol that standardizes preprocessing, retrieval configurations, and generation settings, both architectures show task-specific strengths and neither dominates across the board.²² The analysis also surfaces failure modes, efficiency trade-offs, and evaluation biases that no single GraphRAG paper had previously laid out together.

A third paper goes further, arguing that the published GraphRAG numbers are systematically overstated. The current answer-evaluation framework for GraphRAG has two flaws (unrelated questions and evaluation biases) that lead to inflated performance claims, and under an unbiased framework the gains of three representative GraphRAG methods are much more moderate than reported.²⁴ The argument is not that GraphRAG is bad; the argument is that the field has been measuring GraphRAG's advantage against vector RAG with a methodology that exaggerates the gap. The right reading of the existing benchmarks is therefore directional rather than quantitative: GraphRAG helps on the workloads where the taxonomy says it helps, but the magnitude of improvement is probably less than the original papers reported.

A fourth comparison point comes from an ontology-driven evaluation that places Microsoft GraphRAG against schema-driven graphs built from ontologies and against vector RAG, on the same set of questions. The headline numbers are striking: Microsoft GraphRAG scores 90 percent accuracy (18/20); a graph derived from a relational-database ontology with text chunks also scores 90 percent (18/20); a graph learned from text with chunks scores 90 percent (18/20); vector RAG scores 60 percent (12/20).²⁰ The 30-point gap between graph-based methods and vector RAG is real, but the more interesting finding is that the three graph-based methods are tied. A schema-driven graph built from a relational database can match a learned graph on the same workload, which has implications for cost: where a schema already exists, the LLM-extraction step may not be the cheapest way to build the graph.

The same paper surfaces a second finding worth knowing: graph-based methods without text chunks collapse. Text Ontology (no chunks) scores 15 percent (3/20), and RDB Ontology (no chunks) scores 20 percent (4/20). The graph alone is not enough. The graph plus the source-text chunks the graph points back to is what produces the accuracy gain, which is a structural argument for keeping the source text in the retrieval pipeline rather than treating the graph as a self-contained knowledge representation. Every reference implementation discussed earlier does this in practice, but the empirical magnitude of the dependence on source text has not been quantified elsewhere this cleanly.

. . .

Acknowledged Limitations

The Microsoft GraphRAG team's own statements about the limitations of their system are unusually candid for a vendor-released product, and they constrain the claims the rest of the literature can responsibly make. The relevant sources are the original paper's discussion section and the project's RAI transparency document.

The first limitation is generalization scope. The headline GraphRAG results were demonstrated on a narrow evaluation: two corpora, one workload type (sensemaking questions), one scale (~1M tokens each).⁵ Any extrapolation from those results to other domains is the reader's extrapolation, not the authors'. The subsequent benchmark literature exists in part to do that extrapolation responsibly, and as the previous section showed, the extrapolation is uneven: GraphRAG generalizes well to some workloads (complex reasoning, summarization, creative generation) and poorly to others (simple fact lookup).

The second limitation is extraction-prompt quality. The graph's quality is bounded by the extraction prompt's quality, and the extraction prompt is a per-domain configuration that Microsoft provides defaults for but does not solve for every domain.⁹ For a financial-services deployment the extraction prompt has to know what "credit default swap" or "Tier 1 capital ratio" mean as entity types; the defaults Microsoft ships do not. The configuration cost is real, and it shows up as data-engineering time rather than as runtime cost.

The schema-free versus schema-driven tradeoff sits behind this discussion, and LlamaIndex's PropertyGraphIndex documentation surfaces it as a side-by-side API choice within a single framework. SimpleLLMPathExtractor is schema-free: the language model decides what entity and relationship types appear, with no constraints imposed up front. SchemaLLMPathExtractor is schema-driven: the developer provides a typed schema of expected entity types and relationship types, and the language model is constrained to extract only what the schema allows. Each approach has a clear trade-off. Schema-free extraction adapts to whatever the corpus contains but produces inconsistent types across runs and chunks, which makes the resulting graph harder to query reliably. Schema-driven extraction produces a clean and queryable graph but requires the developer to know the schema up front, which is exactly the work the LLM was supposed to automate. The earlier ontology-comparison finding (that schema-driven graphs match learned graphs on accuracy) is relevant here: a known schema can substitute for the language model's discovery work at the cost of the up-front schema-design effort.

The third limitation is the model-level risk surface.⁹ Any LLM may produce inappropriate or offensive content, and in a GraphRAG pipeline this applies twice: once at extraction time (the entities and relationships the model surfaces can be wrong, biased, or harmful), and again at generation time (the answer the model produces from the retrieved context can be wrong, biased, or harmful). The fact that the graph index itself was generated by a language model means that even a faithful retrieval over the index can surface upstream extraction errors as if they were ground truth.

Two additional limitations the literature has surfaced are worth naming alongside the vendor's own. The first is the corpus-update problem: as the source corpus changes, the graph drifts away from the source, and re-indexing is expensive. The 610,000-token figure cited earlier (for re-indexing a Legal-dataset-sized addition) is the most concrete evidence of the magnitude. The second is the entity-resolution problem mentioned earlier: surface-form variants of the same entity that the extraction model does not reconcile to a single node produce a sparser, more fragmented graph than the corpus actually warrants. Both limitations are operationally significant and neither is specific to Microsoft's implementation; both are properties of the architecture rather than properties of the underlying technology that might be fixed by a better model.

. . .

The Operational Concerns

The operational picture of a GraphRAG deployment is different from a vector-RAG deployment in ways that matter for on-call burden, deployment cadence, and engineering staffing. The differences are not all disadvantages, but they are different, and a team adopting the architecture should understand them before the system is in production rather than after.

The first operational fact is the indexing cadence. A vector-RAG index can be updated incrementally one document at a time, because each document's embedding is a self-contained artifact that does not depend on the rest of the corpus. A GraphRAG index, by contrast, has structure that depends on the corpus as a whole: the entity-resolution step needs to know what other entities exist to decide whether a newly extracted mention is a duplicate, and the community-detection step has to be re-run when the graph topology changes. The Microsoft implementation handles incremental updates, but the cost is non-trivial, and the 610,000-token figure for a Legal-dataset-sized addition is evidence of the magnitude. For a corpus that changes once a quarter, this is manageable. For a corpus that changes hourly, it is not, and the team will spend more time managing the indexing pipeline than building features on top of it.

The second operational fact is the prompt-engineering surface. The extraction prompt is a piece of code, and like any piece of code it has versions, regressions, and downstream consumers. A change to the extraction prompt re-shapes the graph and therefore re-shapes the answers the global-search retriever produces, which means the prompt is part of the system's interface even though it is not exposed as one.⁹ A team that does not version-control its extraction prompts, and that does not regression-test its global-search answers when those prompts change, is shipping a system whose outputs can shift without warning.

The third operational fact is the cost-volatility surface. The per-token cost of the language model used at extraction time is a variable cost that scales with corpus size, and the cost can change underneath the team if the model is hosted (via provider pricing changes) or if the team upgrades the model (via different token economics). A vector-RAG pipeline has a similar exposure on the embedding-model side, but the magnitude is much smaller, since the embedding pass is a single forward through a much smaller model. A GraphRAG pipeline that re-indexes a 100M-token corpus quarterly on a frontier-class extraction model is exposed to provider pricing in a way a vector-RAG pipeline is not. The dependency-parsing alternative discussed earlier is essentially an argument for de-risking this exposure by moving the extraction step off the language-model dependency entirely.

The fourth operational fact is the maintenance burden of a structured representation that drifts. A graph is a structured artifact, and structured artifacts have schemas (even when the schema is implicit), and schemas drift over time as the corpus changes and the extraction prompts evolve. A new entity type that appears in the corpus and is surfaced by the extraction prompt becomes part of the graph; older queries written against the older graph may need to be revised to account for it. A relationship type that becomes deprecated remains in the historical graph until the index is rebuilt. The maintenance burden is small compared to a hand-maintained ontology, but it is larger than for a vector-RAG pipeline, which has no schema to drift.

The Microsoft project itself is unusually transparent about all of this, advising new adopters to start small precisely because the cost shape and the prompt-quality issues will surface fastest on a representative subset of the corpus. A first indexing run at smaller scale is the right way to discover those failure modes before committing to a full corpus index. Skipping the small experiment, then discovering at full scale that the extraction prompts misidentified entity types or that the per-token costs blew the budget, is the most common mistake teams make and the most expensive one to recover from.

. . .

When to Reach for GraphRAG

Pulling the threads of the previous sections together yields a workload-shaped decision framework. The framework is not a checklist, but it is a set of properties of the workload that, taken together, predict whether GraphRAG will earn back its indexing investment. The framework distinguishes three workload shapes: best-fit, marginal-fit, and wrong-fit.

Property	Best fit for GraphRAG	Wrong fit for GraphRAG
Question type	Sensemaking, summarization across the corpus, multi-hop reasoning	Specific fact lookup, single-passage answers
Corpus stability	Updates monthly or less often	Updates hourly or in real time
Entity density	Rich named-entity content with many relationships across documents	Free-form text without strong entity structure
Latency budget	Tolerant of per-query latency in the seconds	Latency-critical, sub-second response required
Indexing budget	Willing to spend significant LLM tokens once at index time	Cost-sensitive at indexing, indifferent to per-query cost
Evaluation discipline	Has a labeled evaluation set and runs the BM25/MTEB-style loops from the measuring-retrieval article	No evaluation infrastructure, decisions made by demo

The best-fit workload combines all six properties on the left column. Sensemaking questions over a stable, entity-rich corpus, with a latency budget that tolerates the global-search aggregation cost and an indexing budget that absorbs the LLM calls. This is the workload Microsoft GraphRAG was demonstrated on, and the workload most of the published GraphRAG benchmarks evaluate. For a team whose application looks like this, the architecture is unambiguously the right choice.

The wrong-fit workload combines the right column. Simple fact-lookup questions over a fast-changing corpus with weak entity structure, with a sub-second latency budget and a strict cost-at-indexing constraint. This is the workload where the four-level taxonomy puts vanilla RAG ahead, and where the indexing-cost economics never amortize. Adopting GraphRAG here is choosing a more expensive system that produces worse answers. The right answer is a BM25 or vector pipeline, and the BM25 evaluation loop from the measuring-retrieval article is sufficient to verify that the simpler system is meeting the bar.

The marginal-fit workload is the one most real systems actually face, and the right answer there is usually hybrid. A corpus that has some entity structure but is not uniformly entity-rich, a workload that mixes fact-lookup and reasoning-intensive questions, a latency budget that varies by question type. The hybrid benchmarks (financial transcripts, ORAN specifications) both show that the hybrid architecture is the strongest performer in this middle zone. A team in the middle zone is best served by a system that runs both vector and graph retrieval, blends the rankings with Reciprocal Rank Fusion or similar, and accepts that some queries will be answered primarily by the vector path and some by the graph path.

The decision framework should be run with a representative sample of the actual workload, not with a hypothetical one. A team that thinks its workload is sensemaking-heavy may find, on inspection of the query logs, that the majority of queries are actually fact-lookups; a team that thinks its workload is fact-lookup-heavy may find that the cases users actually care about are the reasoning-intensive ones. The most reliable way to know is to label a sample of real queries against the four taxonomy levels (Fact Retrieval, Complex Reasoning, Contextual Summarize, Creative Generation) and count which level dominates. If Level 1 dominates, GraphRAG is overengineering. If Levels 3 or 4 dominate, GraphRAG is genuinely the right architecture. If the distribution is mixed, hybrid is the answer.

. . .

References

Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.

Microsoft. "Welcome to GraphRAG." Official GraphRAG documentation site. Canonical stage list of the indexing pipeline (slice, extract, cluster, summarize) and canonical query-mode taxonomy (Global, Local, DRIFT, Basic).
Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., Zhang, Y., & Tang, S. (2024). "Graph Retrieval-Augmented Generation: A Survey." arXiv:2408.08921. Formalizes the three-stage GraphRAG workflow (graph-based indexing, graph-guided retrieval, graph-enhanced generation).
Han, H., Wang, Y., Shomer, H., Guo, K., Ding, J., Lei, Y., Halappanavar, M., Rossi, R. A., Mukherjee, S., Tang, X., He, Q., Hua, Z., Long, B., Zhao, T., Shah, N., Javari, A., Xia, Y., & Tang, J. (2024). "Retrieval-Augmented Generation with Graphs (GraphRAG)." arXiv:2501.00309. Decomposes GraphRAG into five components: query processor, retriever, organizer, generator, data source.
Zhang, Q., Chen, S., Bei, Y., Yuan, Z., Zhou, H., Hong, Z., Chen, H., Xiao, Y., Zhou, C., Dong, J., Chang, Y., & Huang, X. (2025). "A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models." arXiv:2501.13958. Names three failure modes of vector RAG that motivate GraphRAG adoption.
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R. O., & Larson, J. (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv:2404.16130. Anchor paper. Defines the global-vs-local distinction, the two-stage indexing pipeline, and reports the 281-minute indexing benchmark.
Larson, J., & Truitt, S. (2024, February 13). "GraphRAG: Unlocking LLM discovery on narrative private data." Microsoft Research Blog. Public launch announcement; contrasts GraphRAG against "baseline RAG" defined as vector-similarity search.
Edge, D., Trinh, H., Truitt, S., & Larson, J. (2024, July 2). "GraphRAG: New tool for complex data discovery now on GitHub." Microsoft Research Blog. Repo-release announcement; frames whole-dataset questions as the place where top-k is the wrong primitive.
Microsoft. "microsoft/graphrag." MIT-licensed reference implementation. Specific files cited: packages/graphrag/graphrag/index/operations/extract_graph/extract_graph.py (entity-and-relationship extraction) and packages/graphrag/graphrag/index/operations/cluster_graph.py (hierarchical Leiden clustering).
Microsoft. "RAI_TRANSPARENCY.md (microsoft/graphrag)." Responsible AI transparency document for the GraphRAG project; enumerates indexing-cost, extraction-prompt-quality, and model-level limitations.
Traag, V. A., Waltman, L., & van Eck, N. J. (2019). "From Louvain to Leiden: guaranteeing well-connected communities." Scientific Reports 9:5233. The algorithm Microsoft GraphRAG uses for community detection; proves connected-community guarantee that Louvain lacks.
Neo4j. "neo4j-graphrag-python documentation." Official Neo4j-maintained Python entry point for building GraphRAG systems on Neo4j.
Neo4j. "User Guide: RAG (neo4j-graphrag-python)." Documents the nine retriever classes (VectorRetriever, VectorCypherRetriever, HybridRetriever, HybridCypherRetriever, ToolsRetriever, Text2CypherRetriever, WeaviateNeo4jRetriever, PineconeNeo4jRetriever, QdrantNeo4jRetriever).
Bratanic, T. (2024, March 15). "Enhancing the Accuracy of RAG Applications With Knowledge Graphs." LangChain blog. Reference pattern for combining graph and vector retrieval in a LangChain / Neo4j pipeline using LLMGraphTransformer.
LlamaIndex. "Property Graph Index Guide (PropertyGraphIndex)." Framework documentation. Documents pluggable extractors (SimpleLLMPathExtractor, ImplicitPathExtractor, DynamicLLMPathExtractor, SchemaLLMPathExtractor) and four parallel sub-retrievers.
LlamaIndex. "KnowledgeGraphIndex API reference (deprecated)." Earlier triplet-based API, deprecated as of version 0.10.53 in favor of PropertyGraphIndex.
Sarmah, B., Hall, B., Rao, R., Patel, S., Pasquali, S., & Mehta, D. (2024). "HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction." arXiv:2408.04948. Empirical finding on financial earnings-call transcripts that hybrid retrieval outperforms either component alone.
Ahmad, S., Nezami, Z., Hafeez, M., & Zaidi, S. A. R. (2025). "Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)." arXiv:2507.03608. Hybrid GraphRAG improves factual correctness by 8% and GraphRAG improves context relevance by 11% over vanilla RAG on ORAN technical specifications.
Min, C., Bansal, S., Pan, J., Keshavarzi, A., Mathew, R., & Kannan, A. V. (2025). "Towards Practical GraphRAG: Efficient Knowledge Graph Construction and Hybrid Retrieval at Scale." arXiv:2507.03226. Dependency-parsing extraction recovers 94% of LLM-extracted-graph quality; hybrid retrieval via Reciprocal Rank Fusion produces 15% and 4.35% improvements over vanilla vector retrieval.
da Cruz, T., Tavares, B., & Belo, F. (2025). "Ontology Learning and Knowledge Graph Construction: A Comparison of Approaches and Their Impact on RAG Performance." arXiv:2511.05991. Head-to-head comparison: GraphRAG, RDB-Ontology with Chunks, and Text-Ontology with Chunks all score 90%; Vector RAG scores 60%; graphs without chunks collapse to 15-20%.
Han, H., Ma, L., Wang, Y., Shomer, H., Lei, Y., Qi, Z., Guo, K., Hua, Z., Long, B., Liu, H., Aggarwal, C. C., & Tang, J. (2025). "RAG vs. GraphRAG: A Systematic Evaluation and Key Insights." arXiv:2502.11371. Unified evaluation protocol; finds task-specific strengths on both sides under fair comparison.
Xiang, Z., Wu, C., Zhang, Q., Chen, S., Hong, Z., Huang, X., & Su, J. (2025). "When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation." arXiv:2506.05690 (ICLR 2026). Four-level task taxonomy (Fact Retrieval, Complex Reasoning, Contextual Summarize, Creative Generation); per-query token cost comparisons (MS-GraphRAG global at ~331k vs vanilla RAG at ~880 tokens per query).
Zeng, Q., Yan, X., Luo, H., Lin, Y., Wang, Y., Fu, F., Du, B., Xu, Q., & Jiang, J. (2025). "How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG." arXiv:2506.06331. Methodological critique; under unbiased evaluation, reported GraphRAG gains are "much more moderate than reported previously."
Gutierrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). "HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models." NeurIPS 2024. Personalized PageRank over LLM-extracted graph; up to 20% gains on multi-hop QA; 10-30x cheaper than iterative retrieval.
Guo, Z., Xia, L., Yu, Y., Ao, T., & Huang, C. (2024). "LightRAG: Simple and Fast Retrieval-Augmented Generation." EMNLP 2025. Dual-level retrieval (low-level entity detail, high-level theme aggregation); LightRAG uses fewer than 100 tokens per retrieval where GraphRAG uses 610,000; integrates incremental updates without full reconstruction.

GraphRAG Knowledge Graphs Microsoft GraphRAG Neo4j Community Detection RAG Information Retrieval