← Back to article

Sources

Grounding, citations, and further reading for Ontology-Driven Parsing for Retrieval.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to trace each numbered reference back to its primary source, follow the W3C specifications, and read deeper into the schema-versus-extraction tradeoffs in production GraphRAG.

Nothing here is required. The numbered references in the article hyperlink to the entries below; each entry carries a back-to-article link so you can resume reading where you left off.

About the Sources

Gruber: A Translation Approach to Portable Ontology Specifications

Gruber, T. R. (1993). Knowledge Acquisition.

The definitional anchor for the field. Defines an ontology as "an explicit specification of a conceptualization," the phrase every subsequent textbook on knowledge representation references. The translation-approach framing also seeds the schema-versus-runtime-representation distinction the article rests on. Available at tomgruber.org.

W3C: RDF 1.1 Concepts and Abstract Syntax

W3C Recommendation (2014).

The canonical specification for the RDF data model. Defines the triple as the atomic unit and the abstract syntax that all RDF serializations (Turtle, N-Triples, RDF/XML, JSON-LD) target. Available at w3.org/TR/rdf11-concepts.

W3C: OWL 2 Web Ontology Language

W3C Recommendation (2012).

The canonical specification for OWL 2. The entry point for cardinality constraints, transitive and inverse properties, equivalence and disjointness, and the other expressive constructs that make ontologies queryable with confidence. The OWL 2 Direct Semantics is what description-logic reasoners implement. Available at w3.org/TR/owl2-overview.

W3C: SKOS Simple Knowledge Organization System

W3C Recommendation (2009).

The canonical specification for SKOS. Designed for thesauri and controlled vocabularies rather than logical reasoning, and the right tool for the vocabulary-curation side of an ontology when the schema does not need OWL-grade expressiveness. Standard constructs: broader, narrower, related, prefLabel, altLabel. Available at w3.org/TR/skos-reference.

Guha, Brickley & Macbeth: Schema.org

Guha, R. V., Brickley, D., & Macbeth, S. (2016). Communications of the ACM.

The reference history for the schema.org effort. Documents the pragmatic principles that drove its design (web-scale adoption, search-engine markup, deliberate under-specification) and the trade-offs that came with optimizing for breadth over depth. Useful background for understanding why schema.org is a reasonable general reference but rarely sufficient for domain work. Available at cacm.acm.org.

Edge et al.: Microsoft GraphRAG

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). arXiv:2404.16130.

The Microsoft GraphRAG anchor paper. The LLM-extracted-graph pattern and the global-versus-local query distinction that motivates much of the typed-extraction discipline. Useful context for why production GraphRAG systems pair an LLM extractor with a curated ontology rather than relying on either alone. Available at arxiv.org/abs/2404.16130.

Gutierrez et al.: HippoRAG

Gutierrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). NeurIPS 2024.

One of the competing systems that uses LLM-driven knowledge graph construction with Personalized PageRank traversal. Instructive as a comparison to the Microsoft community-summarization approach: same upstream extraction discipline, different retrieval primitive. Available at arxiv.org/abs/2405.14831.

What an Ontology Actually Is

1Gruber's definitional anchor

The Gruber 1993 paper is the source of the working definition of an ontology that the field still uses: an explicit specification of a conceptualization. The phrase has two load-bearing parts. "Explicit specification" means the schema is written down in a form a machine can read, not held informally in the heads of domain experts. "Conceptualization" means the schema captures an abstract model of the domain, not the surface text of any particular document.

The translation-approach framing in the paper is also worth knowing. Gruber treats an ontology as a contract between knowledge sources written in different formalisms, with the ontology serving as the shared vocabulary that lets them exchange data. This is exactly the role an ontology plays in a modern GraphRAG pipeline: it is the contract between the LLM extractor, the storage layer, and the query layer.

Gruber (1993), A Translation Approach to Portable Ontology Specifications. tomgruber.org

↩ Back to article

The Standards Landscape

2RDF as the triple-based data model

RDF 1.1 is the W3C Recommendation that defines the triple as the atomic unit of knowledge representation on the semantic web. Everything in an RDF graph is a triple of (subject, predicate, object), and every other semantic-web standard sits on top of this substrate. The abstract syntax in the specification is what all serialization formats (Turtle, N-Triples, RDF/XML, JSON-LD) materialize.

The article's claim that "the discipline is in the schema, not in the serialization format" reflects a practical reading of RDF: production GraphRAG systems rarely exchange data in formal RDF, but the conceptual model of typed triples flows from this specification.

W3C (2014), RDF 1.1 Concepts and Abstract Syntax. w3.org/TR/rdf11-concepts

↩ Back to article

3OWL 2 as the expressive constraint layer

OWL 2 is the W3C Recommendation that defines the expressive constraints layered on top of RDF: cardinality (a Well has exactly one Operator), transitivity (if A is part of B and B is part of C, then A is part of C), equivalence and disjointness, inverse properties, and class expressions that go beyond simple subclassing. The Direct Semantics is what description-logic reasoners implement, and it is what enables the inference and consistency-checking queries the article describes.

The OWL 2 profiles (EL, QL, RL) trade expressiveness for computational tractability. Most production ontologies pick the smallest profile that supports the constraints they actually need, because reasoning over a full-OWL ontology is decidable but expensive. The constraint examples in the article (a Well has exactly one current Operator, a Formation is located in exactly one Basin) are within the OWL 2 RL profile, which is the one most graph databases support natively.

W3C (2012), OWL 2 Web Ontology Language Document Overview. w3.org/TR/owl2-overview

↩ Back to article

4SKOS for controlled vocabularies

SKOS is the W3C Recommendation designed for thesauri, glossaries, and controlled vocabularies rather than for logical reasoning. The constructs (broader, narrower, related, prefLabel, altLabel) capture vocabulary structure without committing to OWL-grade semantics, which makes SKOS the right tool when the ontology's job is to name things consistently rather than to reason over them.

The synonym-table discipline the article describes (ExxonMobil, Exxon Mobil Corporation, XOM, and Exxon all resolving to the same Operator instance) maps directly onto SKOS prefLabel and altLabel. Most production ontologies use SKOS for the vocabulary layer and OWL for the schema layer, with RDF as the common substrate.

W3C (2009), SKOS Simple Knowledge Organization System Reference. w3.org/TR/skos-reference

↩ Back to article

5Schema.org and the runtime-representation question

Guha, Brickley, and Macbeth's Communications of the ACM article is the reference history for the schema.org effort. The paper documents the pragmatic principles that drove the design: web-scale adoption was prioritized over depth of formal modeling, the type vocabulary was kept deliberately shallow to encourage uptake by non-specialists, and search-engine markup (Google, Microsoft, Yahoo) was the primary consumer.

The relevance to GraphRAG is the trade-off schema.org documents. A general-purpose ontology with shallow types reaches wide adoption but loses domain precision. A domain ontology with deep types serves a narrow community well but does not transfer. The article's claim that schema.org is a useful reference but rarely sufficient alone for domain work reflects this trade-off, and the same logic explains why most production GraphRAG systems do not use formal RDF or SPARQL: the discipline is in the schema design, not in the serialization format.

Guha, Brickley & Macbeth (2016), Schema.org: Evolution of Structured Data on the Web. Communications of the ACM

↩ Back to article

When Ontology-Driven Parsing Earns Its Place

6Microsoft GraphRAG and the LLM-extractor pattern

The Edge et al. Microsoft GraphRAG paper is the most-cited reference implementation of the LLM-extracted-graph pattern. The paper's indexing pipeline runs an LLM per chunk to extract entities and relationships, then merges the extracted fragments into a corpus-level graph and runs hierarchical Leiden community detection to partition it.

The relevance to the article's argument is that Microsoft's reference implementation pairs an LLM extractor with prompt-level type constraints. The extraction prompt enumerates the canonical entity types and relationship types the model should use, which is exactly the role an ontology plays in a typed extraction pipeline. The graph's quality is bounded by the prompt's quality, which is the operational fact the article's "the ontology is what makes the LLM's extractions reproducible" framing rests on.

Edge et al. (2024), From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130

↩ Back to article

7HippoRAG as the alternative retrieval primitive

HippoRAG shares the upstream extraction pattern with Microsoft GraphRAG (LLM-extracted typed triples) but replaces the downstream community-summarization step with Personalized PageRank over the graph. The paper reports gains of up to 20 percent on multi-hop question answering and 10 to 30 times cheaper retrieval than iterative methods.

The relevance to the article is structural: HippoRAG demonstrates that an ontology-grounded extraction can serve multiple retrieval primitives. The same typed graph supports community-summary aggregation (Microsoft), Personalized PageRank ranking (HippoRAG), and dual-level retrieval (LightRAG). The ontology is the shared substrate; the retrieval primitive is a separate design choice.

Gutierrez et al. (2024), HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831

↩ Back to article