← Back to article

Sources

Grounding, citations, and further reading for GraphRAG: When the Index Is a Graph.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources, see the original wording of the published claims, and read deeper into the survey literature.

Nothing on this page is required reading. The numbered references in the article hyperlink to the corresponding entries here, so you can jump in at the point of interest and follow the back-to-article link to return.

About the Sources

Edge et al.: From Local to Global (anchor paper)

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R. O., & Larson, J. (2024). arXiv:2404.16130.

The Microsoft Research paper that defined the modern GraphRAG architecture. Names the global-vs-local query distinction, proposes the two-stage indexing pipeline (entity-graph extraction plus community summarization), and reports the 281-minute indexing benchmark that has framed every cost conversation since. Available at arxiv.org/abs/2404.16130.

Microsoft GraphRAG documentation and repository

Microsoft. Official docs site and MIT-licensed reference implementation.

The canonical implementation. Documentation at microsoft.github.io/graphrag enumerates the indexing pipeline stages and the four query modes (Global, Local, DRIFT, Basic). Repository at github.com/microsoft/graphrag; the entity-extraction and community-clustering modules are cited by file path in the article.

Peng et al.: GraphRAG survey

Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., Zhang, Y., & Tang, S. (2024). arXiv:2408.08921.

Formalizes the three-stage GraphRAG workflow (graph-based indexing, graph-guided retrieval, graph-enhanced generation). Useful as a structural map of the field. Available at arxiv.org/abs/2408.08921.

Han et al.: Retrieval-Augmented Generation with Graphs

Han, H., Wang, Y., Shomer, H., et al. (2024). arXiv:2501.00309.

Decomposes a GraphRAG system into five components: query processor, retriever, organizer, generator, data source. Names the graph itself as the data source rather than as a retriever choice, which sharpens the architectural framing. Available at arxiv.org/abs/2501.00309.

Zhang et al.: customized GraphRAG survey

Zhang, Q., Chen, S., Bei, Y., et al. (2025). arXiv:2501.13958.

Surveys GraphRAG variants for customized LLM applications. Names three specific failure modes of flat text retrieval that motivate the graph alternative: complex query understanding in professional contexts, knowledge integration across distributed sources, and system efficiency at scale. Available at arxiv.org/abs/2501.13958.

Microsoft Research blog posts (2024 launch coverage)

Larson, J., Truitt, S., Edge, D., & Trinh, H. (2024). Microsoft Research Blog.

Two posts accompanying the public launch and GitHub release of GraphRAG. The first (February 2024) frames GraphRAG against baseline RAG defined as vector-similarity search. The second (July 2024) frames whole-dataset questions as the place where top-k retrieval is the wrong primitive. Useful as a vendor-side rhetorical record alongside the academic paper.

Microsoft RAI transparency document

Microsoft. RAI_TRANSPARENCY.md in the GraphRAG repository.

The project's Responsible AI transparency document. Enumerates the indexing-cost concern, the extraction-prompt-quality dependency, and the model-level risk surface that the article calls out as load-bearing operational facts. Linked at github.com/microsoft/graphrag/blob/main/RAI_TRANSPARENCY.md.

Bratanic / LangChain knowledge-graph writeup

Bratanic, T. (2024, March 15). LangChain blog.

Reference pattern for combining graph and vector retrieval in a LangChain / Neo4j pipeline using LLMGraphTransformer. Articulates the motivation for hybrid graph-plus-vector retrieval in plain practitioner terms. Available at blog.langchain.com.

HippoRAG paper

Gutierrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). NeurIPS 2024. arXiv:2405.14831.

Replaces community summarization with Personalized PageRank over an LLM-extracted graph. Reports gains of up to 20% on multi-hop question answering and 10 to 30 times cheaper retrieval than iterative methods. Available at arxiv.org/abs/2405.14831.

LightRAG paper

Guo, Z., Xia, L., Yu, Y., Ao, T., & Huang, C. (2024). EMNLP 2025. arXiv:2410.05779.

Dual-level retrieval (low-level entity, high-level theme) plus incremental indexing. Positions explicitly against Microsoft GraphRAG and reports dramatic cost reductions on retrieval and on corpus updates. Available at arxiv.org/abs/2410.05779.

Leiden algorithm paper (Traag, Waltman, van Eck)

Traag, V. A., Waltman, L., & van Eck, N. J. (2019). Scientific Reports 9:5233.

The community-detection algorithm Microsoft GraphRAG uses. Proves that Leiden guarantees connected communities, which is the property that justifies treating each community summary as a coherent thematic region. Available at arxiv.org/abs/1810.08473.

Neo4j GraphRAG Python documentation

Neo4j. Official documentation site.

Official Neo4j-maintained Python entry point for building GraphRAG systems on Neo4j. The user guide documents nine retriever classes, illustrating the pluralism of retrieval primitives that the same graph index can support. Available at neo4j.com/docs/neo4j-graphrag-python/current/.

LlamaIndex PropertyGraphIndex documentation

LlamaIndex. Framework documentation.

Documents pluggable extractors (SimpleLLMPathExtractor, ImplicitPathExtractor, DynamicLLMPathExtractor, SchemaLLMPathExtractor) and four parallel sub-retrievers. The framework-level abstraction that makes the schema-free / schema-driven tradeoff explicit as an API choice. Available at developers.llamaindex.ai.

Comparative-benchmark papers (Xiang, Han, Zeng, da Cruz)

Four 2025 evaluations that contextualize the original Microsoft results.

Xiang et al. (ICLR 2026) introduces the four-level task taxonomy used to predict which workloads favor graph retrieval. Han et al. supplies a unified evaluation protocol. Zeng et al. critiques the evaluation methodology and finds reported GraphRAG gains overstated. Da Cruz et al. compares Microsoft GraphRAG against ontology-driven graphs and against vector RAG. Each is cited individually below.

HybridRAG, ORAN, and Towards Practical GraphRAG

Three papers grounding the hybrid graph-plus-vector pattern.

Sarmah et al. (HybridRAG, 2024) evaluates the hybrid pattern on financial earnings-call transcripts. Ahmad et al. (ORAN, 2025) replicates the finding on telecom specifications. Min et al. (Towards Practical GraphRAG, 2025) proposes dependency-parsing extraction plus Reciprocal Rank Fusion as a lower-cost alternative pipeline.

What "Graph" Means Here

2Peng et al. survey definition of GraphRAG

Peng et al. open their survey with a definition that captures the load-bearing intuition behind the architecture: "Graph, by its intrinsic 'nodes connected by edges' nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications." The survey formalizes the three-stage GraphRAG workflow: graph-based indexing, graph-guided retrieval, and graph-enhanced generation.

Peng et al. (2024), Graph Retrieval-Augmented Generation: A Survey. arXiv:2408.08921

↩ Back to article

3Han et al. five-component decomposition

The Han et al. companion survey decomposes a GraphRAG system into five components: query processor, retriever, organizer, generator, and data source. The decomposition names the graph itself as the data source rather than as a retriever choice, which sharpens the structural framing: the graph is upstream of retrieval, not an alternative similarity function bolted onto an otherwise-vector pipeline.

Han et al. (2024), Retrieval-Augmented Generation with Graphs (GraphRAG). arXiv:2501.00309

↩ Back to article

4Zhang et al. on the limits of flat text retrieval

The Zhang et al. survey of customized GraphRAG names three specific limitations of flat text retrieval that motivate the graph alternative: "(i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale." The third point is the cost lever the indexing pipeline pulls on: pre-computing structure at ingest amortizes work that flat retrieval would otherwise repeat per query. GraphRAG front-loads the LLM calls into ingest, where they amortize across thousands of future queries; vector RAG spreads that same compute across per-query inference, where it never amortizes.

Zhang et al. (2025), A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv:2501.13958

↩ Back to article

1Microsoft's "structured, hierarchical" framing

The official Microsoft GraphRAG documentation calls the approach "a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets." The contrast term is "naive semantic-search," meaning vector top-k. The framing is partly marketing, since vector RAG is not naive in any technical sense, but the underlying architectural claim survives the rhetoric: GraphRAG indexes structure, and vector RAG indexes text.

Microsoft, Welcome to GraphRAG. microsoft.github.io/graphrag

↩ Back to article

6Microsoft Research launch announcement

The Microsoft Research blog post that accompanied the public launch contrasts GraphRAG against "baseline RAG" defined as vector-similarity search. The post calls out two failure modes: baseline RAG "struggles to connect the dots" when answering "requires traversing disparate pieces of information," and it "performs poorly when being asked to holistically understand summarized semantic concepts over large data collections." Useful as a vendor-side rhetorical record of how the project framed itself at launch.

Larson & Truitt (2024, February 13), GraphRAG: Unlocking LLM discovery on narrative private data. Microsoft Research Blog

↩ Back to article

The GraphRAG Anchor Paper

5Edge et al. on the global-query failure mode

The anchor paper for the current GraphRAG conversation is Edge, Trinh, Cheng, Bradley, Chao, Mody, Truitt, Metropolitansky, Ness, and Larson's From Local to Global: A Graph RAG Approach to Query-Focused Summarization, released by Microsoft Research in April 2024 and revised in February 2025.

The paper's opening framing of the failure mode is unusually clean: "RAG fails on global questions directed at an entire text corpus, such as 'What are the main themes in the dataset?', since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task." A top-k vector retrieval cannot answer the question because no single chunk contains the answer; the answer is a property of the corpus considered as a whole.

The two-stage architecture is described directly: "Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities."

The query pipeline composes an answer rather than retrieving a chunk: "Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user."

The cost number that has framed every subsequent conversation is also from this paper: graph indexing with a 600-token window took 281 minutes for the Podcast dataset, and the community-summary stage required dramatically fewer tokens per query (9 to 43 times less) compared to source-text summarization on the same questions.

The paper's self-assessment of its own limitations is direct: the authors note that their evaluation focused on sensemaking questions specific to two corpora each containing approximately 1 million tokens, and that more work is needed to understand how performance generalizes to datasets from various domains. The strongest GraphRAG result in the literature is therefore defined on two corpora and one workload type, and the authors flag that fact as a generalization risk.

Edge et al. (2024), From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130

↩ Back to article

1Microsoft documentation: four query modes

The Microsoft GraphRAG documentation taxonomizes the query side into four modes. Global Search aggregates across community summaries to answer corpus-wide questions. Local Search "reasons about specific entities by fanning-out to their neighbors." DRIFT Search "combines local reasoning with the added context of community information." Basic Search falls back to baseline RAG when the question is best answered that way. Each mode consumes a different slice of the same indexed structure.

Microsoft, Welcome to GraphRAG. microsoft.github.io/graphrag

↩ Back to article

The Indexing Pipeline

8Microsoft repository: entity extraction module

In the Microsoft reference implementation, entity extraction is encapsulated by a single module: packages/graphrag/graphrag/index/operations/extract_graph/extract_graph.py contains a function extract_graph() documented as "Extract a graph from a piece of text using a language model," with an inner _run_extract_graph() running "the graph intelligence entity extraction strategy."

The Edge et al. paper describes the operation in plain language: "the LLM is prompted to extract instances of important entities and the relationships between the entities from a given chunk." The strategy itself is a prompt-engineering artifact, with the model asked in structured format to enumerate the entities present in a chunk along with their types and descriptions, then to enumerate the relationships between those entities.

Microsoft, microsoft/graphrag; Edge et al. (2024).

↩ Back to article

9RAI document on prompt-dependence

Microsoft's RAI transparency document for the project notes the prompt-engineering dependency directly: "GraphRAG depends on a well-constructed indexing examples. For general applications (e.g. content oriented around people, places, organizations, things, etc.) we provide example indexing prompts. For unique datasets effective indexing can depend on proper identification of domain-specific concepts."

The same document acknowledges the indexing-cost concern: "Indexing is a relatively expensive operation." The project README echoes this with an explicit user-facing warning: "GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small."

The plain reading: the graph's quality is bounded by the extraction prompt's quality, and that prompt is a per-domain configuration burden that the project provides defaults for but does not solve for every domain.

Microsoft, RAI_TRANSPARENCY.md. github.com/microsoft/graphrag

↩ Back to article

14Bratanic / LangChain on LLMGraphTransformer

The Bratanic / LangChain integration writeup makes the same prompt-dependence point from a different vantage. The LLMGraphTransformer "automates" graph construction by analyzing text to "identify entities, understand the relationships between them, and suggest how they might be best represented in a graph structure." The motivation for hybrid graph-plus-vector retrieval is articulated cleanly: "graphs are great at representing and storing heterogeneous and interconnected information in a structured manner, effortlessly capturing complex relationships and attributes across diverse data types," while "vector databases often struggle with such structured information." The hybrid framing: "the best of both worlds" combines "structured graph data with vector search through unstructured text."

Bratanic (2024, March 15). LangChain blog

↩ Back to article

8Microsoft repository: cluster_graph module

Community detection is invoked by another single module in the Microsoft reference implementation: packages/graphrag/graphrag/index/operations/cluster_graph.py imports hierarchical_leiden from graphrag.graphs.hierarchical_leiden and exposes a function with the signature cluster_graph(edges: pd.DataFrame, max_cluster_size: int, use_lcc: bool, seed: int | None = None) -> Communities. The output is a partition of the graph's nodes into clusters, computed at multiple levels of granularity.

Microsoft, microsoft/graphrag

↩ Back to article

The Retrieval Mechanism

7Microsoft GitHub-release announcement

The Microsoft GitHub-release announcement makes the global-vs-top-k point explicit: "if a question addresses the entire dataset, all input texts should be considered," in contrast to approaches that "only examine the top-k most similar chunks." The global-search retriever does not return chunks; it iterates over the community summaries produced at indexing time, asks the language model to extract from each summary the parts relevant to the user's question, and then asks the language model again to reduce the partial extractions into a single coherent answer.

Edge, Trinh, Truitt, & Larson (2024, July 2), GraphRAG: New tool for complex data discovery now on GitHub. Microsoft Research Blog

↩ Back to article

13Neo4j: nine retriever classes

The Neo4j GraphRAG Python library treats retrieval as a choice among parallel retrievers rather than as a fixed pipeline. The user guide lists nine retriever classes available out of the box: VectorRetriever, VectorCypherRetriever, HybridRetriever, HybridCypherRetriever, ToolsRetriever, Text2CypherRetriever, WeaviateNeo4jRetriever, PineconeNeo4jRetriever, QdrantNeo4jRetriever.

Each plays a different role. The VectorRetriever "performs a similarity search based on a Neo4j vector index and a query text or vector." The VectorCypherRetriever "fully leverages Neo4j's graph capabilities by combining vector-based similarity searches with graph traversal techniques." The Text2CypherRetriever "first asks an LLM to generate a Cypher query to fetch the exact information required to answer the question from the database."

The retriever-pluralism is informative: the same graph index can serve as a substrate for vector similarity, traversal, hybrid retrieval, or text-to-Cypher generation, and the choice is a per-query (or per-application) decision rather than an architectural commitment.

Neo4j, User Guide: RAG. neo4j.com/docs/neo4j-graphrag-python

↩ Back to article

15LlamaIndex PropertyGraphIndex sub-retrievers

LlamaIndex's PropertyGraphIndex makes a similar retrieval-pluralism explicit. The framework documents four parallel sub-retrievers: LLMSynonymRetriever "generates keywords/synonyms to retrieve nodes and connected paths"; VectorContextRetriever "retrieves nodes via vector similarity, then fetches connected paths"; TextToCypherRetriever "generates and executes Cypher queries based on graph schema"; CypherTemplateRetriever "uses templated Cypher queries with LLM-filled parameters."

The framework also exposes pluggable extractors on the indexing side: SimpleLLMPathExtractor (schema-free LLM extraction), ImplicitPathExtractor (no LLM, uses existing node.relationships attributes), DynamicLLMPathExtractor, and SchemaLLMPathExtractor (typed-schema validation). The pluggable extractor design lets a team trade off cost against quality without changing the rest of the pipeline.

LlamaIndex, Property Graph Index Guide. developers.llamaindex.ai

↩ Back to article

Leiden and Why It Matters

10Traag, Waltman & van Eck on the Louvain defect

The Microsoft GraphRAG paper states the algorithmic choice plainly: "we use Leiden community detection (Traag et al., 2019) in a hierarchical manner." The Traag, Waltman, and van Eck paper exists specifically to fix a defect in the older Louvain algorithm.

The defect is described in the authors' own words: "we show that this algorithm [Louvain] has a major defect that largely went unnoticed until now: the Louvain algorithm may yield arbitrarily badly connected communities. In the worst case, communities may even be disconnected, especially when running the algorithm iteratively. In our experimental analysis, we observe that up to 25% of the communities are badly connected and up to 16% are disconnected."

A disconnected community is a community in the partition that contains nodes with no path between them. For most graph applications this is cosmetic; for GraphRAG it is substantive, because a disconnected community means the summary the language model writes for it bundles unrelated entities as if they were a coherent thematic region.

The Leiden algorithm fixes the defect by construction. The paper proves that "the Leiden algorithm yields communities that are guaranteed to be connected," and that "when the Leiden algorithm is applied iteratively, it converges to a partition in which all subsets of all communities are locally optimally assigned." Leiden also runs faster than Louvain in practice, "by relying on a fast local move approach."

Traag, Waltman, & van Eck (2019), From Louvain to Leiden. arXiv:1810.08473

↩ Back to article

The Indexing-Cost Economics

19Min et al. on practical extraction costs

The Min et al. Towards Practical GraphRAG paper is the clearest statement in the literature of indexing cost as a problem worth solving rather than a fact to be accepted. The authors write that GraphRAG's "adoption has been limited due to reliance on expensive large language model (LLM)-based extraction and complex traversal strategies."

The proposed mitigation is structural: replace the LLM-based extraction step with dependency parsing, recovering most of the quality at a fraction of the cost. The reported numbers are specific: the dependency-parsing pipeline "achieves 94% of LLM-based performance (61.87% vs. 65.83%) while significantly reducing costs." The headline finding is not that LLM extraction is wrong but that it is over-engineered for many practical settings.

The paper also introduces a hybrid retrieval strategy that fuses vector similarity with graph traversal using Reciprocal Rank Fusion, maintaining separate embeddings for entities, chunks, and relations to enable multi-granular matching. It reports improvements of up to 15% and 4.35% over vanilla vector retrieval baselines under LLM-as-Judge evaluation.

Min et al. (2025), Towards Practical GraphRAG. arXiv:2507.03226

↩ Back to article

23Xiang et al. per-query token cost table

The Xiang et al. When to use Graphs in RAG paper provides the most useful published per-query cost numbers, comparing four systems on the same evaluation corpora. Vanilla RAG sits at roughly 880 to 950 tokens per query on Novel and Medical corpora; HippoRAG2 at roughly 1,000; LightRAG at roughly 100,000; MS-GraphRAG (global) at roughly 330,000.

The paper notes that on harder questions the global-search prompt can reach "up to 4x10^4 tokens" of prompt size, with the total token usage compounding because the map-reduce shape calls the language model once per community summary plus once for the final reduction.

The introduction frames the research question directly: "recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems?"

The four-level task taxonomy structures the answer: Level 1 Fact Retrieval, Level 2 Complex Reasoning, Level 3 Contextual Summarize, Level 4 Creative Generation. On Level 1, "Basic RAG is comparable to or outperforms GraphRAG in simple fact retrieval tasks that does not require complex reasoning across connected concepts." On Levels 2 through 4, "GraphRAG models show a clear advantage in complex reasoning, Contextual Summarize, and creative generation." The retrieval-stage numbers reinforce the split: "RAG excels at retrieving discrete facts for simple questions, achieving 83.2% Evidence Recall," while "GraphRAG's advantages emerge clearly as questions grow more complex."

Xiang et al. (2025, ICLR 2026), When to use Graphs in RAG. arXiv:2506.05690

↩ Back to article

26LightRAG on incremental updates

The LightRAG paper makes the corpus-update cost concrete. On incremental updates, the authors report that "GraphRAG required 610,000 tokens and hundreds of API calls" while "LightRAG consumed fewer than 100 tokens requiring only a single API call for the entire retrieval process."

For a Legal-dataset-sized addition the gap is dramatic: "when adding new data equivalent to the Legal dataset's size, GraphRAG required approximately 1,399 x 2 x 5,000 tokens for complete reconstruction, while LightRAG seamlessly integrated new entities and relationships without full reconstruction."

The system retains the LLM-extracted graph but replaces community summarization with dual-level retrieval: a low-level mode "primarily focused on retrieving specific entities along with their associated attributes or relationships" and a high-level mode that "addresses broader topics and overarching themes." The retrieval primitive blends graph traversal with vector embeddings of the graph elements, which is structurally closer to a hybrid pattern than to Microsoft's community-summary aggregation.

Guo et al. (2024, EMNLP 2025), LightRAG. arXiv:2410.05779

↩ Back to article

The Implementation Ecosystem

12Neo4j GraphRAG package positioning

The Neo4j GraphRAG Python package documentation positions the library as "the official Neo4j GraphRAG features for Python," framed as "a first party package to developers, where Neo4j can guarantee long term commitment and maintenance." The Neo4j approach differs from Microsoft's in an important way: it does not commit to one indexing pipeline or one retrieval pattern, but exposes a library of retriever classes and lets the application choose how to compose them.

Neo4j, neo4j-graphrag-python documentation. neo4j.com/docs/neo4j-graphrag-python

↩ Back to article

16LlamaIndex KnowledgeGraphIndex deprecation

The earlier LlamaIndex KnowledgeGraphIndex API is now deprecated. The API reference notes the deprecation directly: "the KnowledgeGraphIndex class has been deprecated. Please use the new PropertyGraphIndex class instead," with the deprecation marker placed at version 0.10.53.

The framework's older API was triplet-based, building a knowledge graph by extracting triplets and leveraging the KG during query-time. The move to PropertyGraphIndex reflects a broader pattern in the field: a flat (subject, predicate, object) triplet store is less expressive than a property graph with typed nodes and edges, and the latter has become the default representation. The deprecation is informative as historical signal: the field's earlier generation of GraphRAG-adjacent systems (pre-2024) used flat triplets, and the current generation uses property graphs.

LlamaIndex, KnowledgeGraphIndex API reference (deprecated). developers.llamaindex.ai

↩ Back to article

Hybrid GraphRAG

17Sarmah et al. on financial earnings calls

The Sarmah et al. HybridRAG paper, evaluated on financial earnings-call transcripts, reports that "HybridRAG which retrieves context from both vector database and KG outperforms both traditional VectorRAG and GraphRAG individually when evaluated at both the retrieval and generation stages." The framing matters: hybrid retrieval is not a tuning knob on a vector pipeline or a graph pipeline. It is a separately-measurable third architecture, and on the workload the paper evaluated it is the best of the three.

Sarmah et al. (2024), HybridRAG. arXiv:2408.04948

↩ Back to article

18Ahmad et al. on ORAN specifications

The Ahmad et al. ORAN paper extends the hybrid finding to a second domain. Evaluating on Open Radio Access Network technical specifications, the authors report that "both GraphRAG and Hybrid GraphRAG outperform traditional RAG. Hybrid GraphRAG improves factual correctness by 8%, while GraphRAG improves context relevance by 11%." The figures are measured against an explicit set of generation metrics (faithfulness, answer relevance, context relevance, factual correctness) drawn from the established RAG-evaluation literature. Two independent peer-reviewed studies on substantively different corpora both find that the hybrid architecture beats either of its components alone.

Ahmad et al. (2025), Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation Pipelines for Open Radio Access Networks. arXiv:2507.03608

↩ Back to article

HippoRAG and LightRAG

25HippoRAG's PageRank-based retrieval

HippoRAG opens with a neurobiological analogy: the system "synergistically orchestrates LLMs, knowledge graphs, and the Personalized PageRank algorithm to mimic the different roles of neocortex and hippocampus in human memory." The architecture replaces community summarization with Personalized PageRank over the LLM-extracted graph: instead of pre-generating summaries of regions, the system runs PageRank biased toward the entities the query mentions and ranks documents by PageRank-weighted relevance.

The reported results are striking. The paper reports gains of "up to 20%" on multi-hop question answering, and reports that "single-step retrieval with HippoRAG achieves comparable or better performance than iterative retrieval like IRCoT while being 10-30 times cheaper and 6-13 times faster."

Gutierrez et al. (2024, NeurIPS), HippoRAG. arXiv:2405.14831

↩ Back to article

Comparative Benchmarks

22Han et al. systematic evaluation

The Han et al. RAG vs. GraphRAG: A Systematic Evaluation and Key Insights paper supplements Xiang et al. with a methodological argument. The authors observe that "existing GraphRAG systems for text data are often tailored to specific tasks, datasets, and system designs, resulting in heterogeneous evaluation protocols. Consequently, a systematic understanding of the relative strengths, limitations, and trade-offs between RAG and GraphRAG on widely used text benchmarks remains limited."

The contribution is a unified evaluation protocol that "standardizes data preprocessing, retrieval configurations, and generation settings, enabling fair and reproducible comparisons." The headline result is that under the unified protocol, both architectures have task-specific strengths, and neither dominates across the board. The paper also analyzes failure modes, efficiency trade-offs, and evaluation biases, producing a more nuanced picture than any single GraphRAG paper had previously surfaced.

Han et al. (2025), RAG vs. GraphRAG. arXiv:2502.11371

↩ Back to article

24Zeng et al. on overstated gains

The Zeng et al. How Significant Are the Real Performance Gains? paper argues that the published GraphRAG numbers are systematically overstated. The authors observe that "the current answer evaluation framework for GraphRAG has two critical flaws, i.e., unrelated questions and evaluation biases, which may lead to biased or even wrong conclusions on performance."

The proposed unbiased framework controls for both flaws, and the finding is unflattering: applied to three representative GraphRAG methods, "their performance gains are much more moderate than reported previously." The paper does not argue that GraphRAG is bad; it argues that the field has been measuring GraphRAG's advantage against vector RAG with a methodology that exaggerates the gap. The right reading of the existing benchmarks is therefore directional rather than quantitative.

Zeng et al. (2025), How Significant Are the Real Performance Gains? arXiv:2506.06331

↩ Back to article

20Da Cruz et al. ontology comparison

The da Cruz, Tavares, and Belo paper compares vector RAG, Microsoft GraphRAG, and schema-driven graphs built from ontologies (derived from relational databases or learned from text corpora), evaluating each on the same set of questions. The headline numbers: GraphRAG 90% (18/20); RDB-Ontology with Chunks 90% (18/20); Text-Ontology with Chunks 90% (18/20); Vector RAG 60% (12/20).

The 30-point gap between graph-based methods and vector RAG is real, but the more interesting finding is that the three graph-based methods are tied. The paper concludes that "ontology-guided KGs incorporating chunk information achieve competitive performance with state-of-the-art frameworks, substantially outperforming vector retrieval baselines." A schema-driven graph built from a relational database can match a learned graph on the same workload, which has implications for cost.

The paper surfaces a second finding worth knowing: graph-based methods without text chunks collapse. Text Ontology (no chunks) scores 15% (3/20) and RDB Ontology (no chunks) scores 20% (4/20). The graph alone is not enough; the graph plus the source-text chunks the graph points back to is what produces the accuracy gain.

da Cruz, Tavares, & Belo (2025), Ontology Learning and Knowledge Graph Construction. arXiv:2511.05991

↩ Back to article

Acknowledged Limitations

5Edge et al. on generalization scope

The generalization limitation in the Edge et al. paper is stated precisely: "Our evaluation to date has focused on sensemaking questions specific to two corpora each containing approximately 1 million tokens. More work is needed to understand how performance generalizes to datasets from various domains."

The implication is that the headline GraphRAG results were demonstrated on a narrow evaluation: two corpora, one workload type (sensemaking questions), one scale (~1M tokens each). Any extrapolation from those results to other domains is the reader's extrapolation, not the authors'. The subsequent Han, Xiang, and Zeng papers exist in part to do that extrapolation responsibly.

Edge et al. (2024). arXiv:2404.16130

↩ Back to article

9RAI document on model-level risk

The RAI document names a third limitation alongside indexing cost and prompt-quality: the underlying language model "may produce inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations."

This is a generic LLM-output limitation rather than a GraphRAG-specific one, but it applies twice in a GraphRAG pipeline: once at extraction time (the entities and relationships the model surfaces can be wrong, biased, or harmful), and again at generation time (the answer the model produces from retrieved context can be wrong, biased, or harmful). The fact that the graph index itself was generated by a language model means that even a faithful retrieval over the index can surface upstream extraction errors as if they were ground truth.

Microsoft, RAI_TRANSPARENCY.md. github.com/microsoft/graphrag

↩ Back to article