← All Articles

Ontology-Driven Parsing for Retrieval

A knowledge graph without an ontology is just triples. The ontology is the schema that tells a parser which entities to extract, which relations to record, and what types to assert. Without that commitment, graph-based retrieval cannot be queried with confidence. Production GraphRAG succeeds when the ontology is solid and fails, predictably, when the ontology was an afterthought.

What an Ontology Actually Is

The word "ontology" carries weight from philosophy, where it names the study of what exists. In computer science the meaning narrows to something specific and machine-readable: a formal description of the entities, relations, and constraints in a domain.1 The formal expression is what distinguishes an ontology from a taxonomy or a glossary; the machine-readability is what distinguishes it from a textbook.

A useful working definition treats an ontology as a tuple of four components:

Description logic gives this structure a useful vocabulary. The TBox (terminological knowledge) is the schema: the classes, properties, and axioms that hold by definition. The ABox (assertional knowledge) is the data: the instances and the specific triples that link them. Teams typically call the TBox "the ontology" and the ABox "the knowledge graph," but the two are inseparable. The ontology is the schema; the knowledge graph is what conforms to it.

. . .

An Ontology Doing the Work

A working demonstration helps anchor the abstractions before the next section drops back into prose. Mutato is an ontology-driven entity extractor that runs entirely in the browser, with no language model and no learned weights. It carries a curated Middle-earth ontology covering people, places, factions, weapons, and the rest of the Tolkien inventory, and uses three deterministic matching passes to annotate any sentence the user types.

Type a Tolkien-flavored sentence and watch the ontology resolve it.

The three matching passes correspond directly to the components an ontology defines. The exact pass looks for canonical surface forms: if the ontology declares Gandalf as the canonical label for an entity, the pass tags any occurrence of "Gandalf" in the input. The span pass walks the controlled vocabulary, recognizing compositional names: "Aragorn son of Arathorn" resolves to the Aragorn entity even though that string is not the canonical label. The hierarchy pass walks the class structure, so "the Ranger" resolves to Aragorn because the ontology declares Aragorn as an instance of Ranger.

Each extracted entity returns with its canonical identifier, the surface form that produced the match, and a badge naming which pass found it. The output is provenance-rich and reproducible. Run the same input twice and the system returns the same triples; nothing is sampled, nothing drifts. The ontology editor in the same interface lets the operator change the schema, reorder classes, add aliases, and watch the extractions update against the new schema immediately.

The point this demonstration anchors: ontology-driven parsing is not a degraded form of LLM extraction. It is a different operation with different guarantees. An LLM extractor produces a plausible graph at the cost of unpredictable extractions across runs. An ontology-driven extractor produces a smaller, schema-conformant graph at the cost of upfront curation, with the guarantee that every triple was produced by a rule the operator wrote down. The rest of the article argues that the upfront curation pays for itself in queryability and auditability. The demo above is the part of that argument that does not need any prose.

. . .

Why Named-Entity Recognition Is Not Enough

Raw NER tags spans of text as entities of generic types: PERSON, ORG, LOCATION, DATE. Modern NER models built on transformers reach 90%+ F1 on standard benchmarks for these generic types, and they are useful for many things. They are not enough for graph-based retrieval.

The reason is that generic NER does not commit to a domain ontology. It tags "Apple" as ORG without knowing whether it is Apple Inc., the company, or apple the fruit, or someone named Apple. It tags "Wolfcamp" as MISC because no general-purpose entity dictionary contains it. The resulting tags are useful for downstream summarization but not for the typed graph traversal that GraphRAG requires.

Consider the source sentence:

Apache drilled the Smith #4H in the Permian Basin's Wolfcamp formation in March 2024, completing a 9,800-foot lateral with 56 stages.
RAW NER TYPED EXTRACTION Apache ORG Smith #4H often missed or MISC MISC Permian Basin LOC Wolfcamp type collision with ORG MISC March 2024 DATE 9,800 ft / 56 stages measurements not extracted (none) NO RELATIONS tags alone, no graph Apache Operator Smith #4H subject of operating relation Well Permian Basin Basin Wolfcamp locatedIn PermianBasin Formation 2024-03-01 spudDate (Well) 9,800 ft / 56 stages lateralLength, stageCount Completion TYPED SUBGRAPH Smith #4H operatedBy Apache Smith #4H drillsThrough Wolfcamp
Same sentence, two extraction regimes.

The typed output is a small subgraph that merges into the larger knowledge graph. The merging happens by identity: Apache resolves to the same Operator instance across thousands of documents, Wolfcamp resolves to the same Formation instance, and so on. The accumulated graph supports the kinds of queries that no individual document could answer.

. . .

The Components of a Useful Ontology

A working ontology has four kinds of content, each requiring different curation effort and different governance.

COMPONENT 1 COMPONENT 2 COMPONENT 3 COMPONENT 4 Class Hierarchy Well, Operator, Formation, Basin, Equipment... subclassing enables generalization Property Schema operatedBy, drillsThrough, locatedIn, producesFrom... typed domain and range constraints Controlled Vocab ExxonMobil = Exxon Mobil = XOM = Exxon canonical form prevents graph fragmentation Axioms cardinality, disjointness, inverses inference and consistency CURATION EFFORT high CURATION EFFORT highest CURATION EFFORT continuous EFFORT moderate TBox: schema, slow-changing | ABox: instances, fast-changing
Four components of an ontology.

1. The Class Hierarchy

Top-level classes anchor the ontology. In a general-purpose ontology, these are categories like Person, Organization, Place, Event. In a domain ontology, they go deeper and become specific: in oil and gas, classes like Well, Wellbore, Field, Formation, Operator, ProductionPeriod. Subclasses refine the hierarchy: Wellbore is a subclass of Asset; Operator is a subclass of Organization; Wolfcamp is an instance of Formation, which is a subclass of GeologicalUnit.

Subclassing is what enables generalization. If a query asks for "all Assets in the Permian," the graph traversal can include Wells, Wellbores, Equipment, and any other subclass of Asset that participates in the locatedIn relationship. Without a hierarchy, every query has to enumerate every relevant subtype.

2. The Property Schema

Each property has a domain (the class of the subject) and a range (the class of the object). operatedBy has domain Well and range Operator. drillsThrough has domain Wellbore and range Formation. Properties can carry cardinality constraints (a Well has exactly one current Operator) and inverse relationships (operatedBy and operates are inverses).

The property schema is where ontologies do their hardest work. Generic NER does not commit to which entity is the subject and which is the object of a relation; typed extraction must. The schema is what tells the parser to read "Apache drilled Smith #4H" as Smith #4H operatedBy Apache, not the inverse.

3. The Controlled Vocabulary

The same entity appears under different strings across documents. ExxonMobil, Exxon Mobil Corporation, XOM, and Exxon all refer to the same Operator instance. The ontology either commits to a canonical form (with synonym mappings) or maintains an explicit equivalence relation between the variants. Without this discipline, the graph fragments: queries against "ExxonMobil" miss documents that named the operator "Exxon Mobil Corporation."

4. The Axioms

Axioms are constraints the ontology asserts to hold. A few examples from a working oil-and-gas ontology:

Axioms enable two things. The first is inference: if the graph contains "Smith #4H operatedBy Apache" and "Apache mergedInto APA Corporation," the axioms can derive that Smith #4H is now operated by APA Corporation. The second is consistency checking: if extraction produces "Smith #4H operatedBy Apache" and "Smith #4H operatedBy Devon Energy" for the same time period, the cardinality axiom flags the conflict for review.

. . .

The Standards Landscape

The W3C semantic web standards form a layered stack, from least expressive to most. RDF2 defines the triple-based data model, OWL 23 defines the expressive constraints that make ontologies queryable with confidence, and SKOS4 handles the controlled-vocabulary side of the same problem. Most production ontologies use a mix of layers depending on what is actually needed.

StandardWhat it providesWhen it fits
RDFThe data model. Everything is a triple of (subject, predicate, object). The atomic representation.Always. RDF is the substrate every other layer sits on.
RDFSBasic schema vocabulary: Class, subClassOf, Property, domain, range. Enough to declare a class hierarchy.When the schema is simple and the team prefers a small footprint.
OWLExpressive constraints: cardinality, transitive properties, equivalence, disjointness, inverse properties. Decidable reasoning.When the ontology needs inference or strict consistency checking.
SKOSThesaurus and controlled-vocabulary constructs: broader, narrower, related, prefLabel, altLabel.When the goal is vocabulary management rather than logical reasoning.
schema.orgDe facto web-scale ontology. Types for Person, Organization, Place, Event, Product, etc.As a reference point for general-purpose entity types; rarely sufficient alone for domain work.

In practice, an enterprise ontology typically uses RDF as the data model, OWL or RDFS for the class hierarchy and property schema, SKOS for vocabulary curation, and schema.org as a reference for any types that overlap with the general web. The combination is more common than any single standard used in isolation.

One pragmatic note: most modern GraphRAG implementations do not exchange data in formal RDF or expose SPARQL endpoints. They use JSON, property graphs (Neo4j-style), or in-memory representations. The semantic web standards inform the schema design, but the runtime representation is whatever the implementation finds convenient. This is fine. The discipline is in the schema, not in the serialization format.5

. . .

What a Parsed Corpus Enables

The query types that become possible against a corpus parsed into a typed knowledge graph are different in kind from what vector RAG supports. Each example below describes a query that vector retrieval cannot answer at all.

Multi-hop traversal. Find all operators running wells in the Wolfcamp formation completed since 2023 with lateral lengths over 10,000 feet. No single document contains all four constraints. Vector RAG returns documents that mention some of the constraints; graph traversal walks from Wolfcamp to Wells, filters by completion date, filters by lateral length, joins to Operators.

Aggregation. What is the median production from Apache-operated Wolfcamp wells in the first half of 2024? Requires identifying all qualifying wells, joining production data, computing a statistic. SPARQL handles this directly. A vector RAG system cannot, because aggregation is not a retrieval operation.

Constraint checking. Are there any wells listed as operated by Apache in the December 2024 production filing but recorded as operated by a different operator in the well-permit database? Requires a join across two data sources, both parsed against the same ontology, with the conflict surfaced by an axiom about Operator uniqueness.

Inference. Given the axiom that every Well has exactly one Operator at a time, identify wells in the graph where the extracted triples violate this constraint. The reasoner walks the graph, applies the axiom, and surfaces inconsistencies. This is where OWL's expressiveness earns its place over plain RDF.

The key point: the parsed corpus answers a different class of question than the unparsed corpus. Vector RAG answers "find me documents that look like the question." Typed graph traversal answers "find me entities and relationships that satisfy these constraints." These are not the same operation, and neither replaces the other.

. . .

The Operational Cost of Curation

Curating an ontology is human work, and there is no shortcut.

A domain ontology of moderate scope (200 to 500 classes, 100 to 200 properties) typically requires four to twelve weeks of subject-matter expert time working alongside an ontology engineer for the initial schema design. Another four to eight weeks of work go into the initial controlled vocabulary buildout, including the synonym tables and the canonical-form decisions. Once the ontology stabilizes, ongoing maintenance runs at about 10 to 20 percent of one full-time engineer's time.

The expensive part is not the formal expression in OWL or RDF; tools for that are mature. The expensive part is domain-expert decision-making about questions the documents themselves rarely answer cleanly:

An opinionated ontology engineer has a pragmatic answer to a question teams often try to avoid: do we need the ontology to be philosophically right, or do we need it to be agreed-upon? The pragmatic answer is usually agreed-upon. An ontology that codifies an arbitrary but consistent set of decisions is more useful than one that tries to be perfect, because the consistent ontology supports queries today while the perfect ontology stays in draft.

. . .

When Ontology-Driven Parsing Earns Its Place

Three signals indicate the upfront investment is justified:

The query workload is structured. Users or downstream agents are going to ask multi-hop, aggregating, or constraint-satisfaction questions. If the workload is "summarize what this document says" or "find documents about topic X," vector RAG is cheaper and adequate. If the workload is "which assets satisfy these joint constraints across multiple data sources," vector RAG cannot answer the question at all.

The corpus has stable entity types. Operations, drilling, production, finance, legal, regulatory: domains where the same entity types appear consistently across documents benefit from a shared ontology. The investment amortizes across queries. Domains where every document introduces new entity types (general web crawling, broad news, exploratory research) do not benefit as cleanly; the ontology either stays small and incomplete, or grows unboundedly and becomes unmaintainable.

The cost of wrong answers is high. Regulated industries, financial reporting, medical diagnostics, safety-critical operations: domains where a wrong answer has compliance or safety consequences benefit from the structured, auditable retrieval path that typed graph traversal provides. Vector RAG's "the most-similar document said so" is harder to defend in a regulatory audit than "the graph encodes this fact and here is the provenance chain."

SIGNAL Vector RAG Ontology-Driven SIGNAL 1 Query workload structured, multi-hop, aggregating CANNOT ANSWER no join, no aggregation WINS SPARQL handles joins SIGNAL 2 Entity types stable across the corpus COMPARABLE entity-blind retrieval WINS investment amortizes SIGNAL 3 Cost of wrong answer compliance, safety, audit HARDER TO DEFEND similarity is not evidence WINS provenance chain DEFAULT none of the above signals fire VECTOR RAG IS ENOUGH OVER-ENGINEERED
Three signals for ontology fit.

LLM-as-Extractor as the Alternative

A more recent option is to skip ontology curation entirely and have an LLM extract entities and relations on a per-document basis, with the LLM choosing the types ad hoc. This works for prototypes and for domains where ontology curation is genuinely impractical. The cost is that the extracted graph is inconsistent across documents: the same entity might be typed differently, the same relation might be named differently, and the graph cannot be queried with confidence that a missing edge means "no relation" rather than "the parser used a different name for this edge in document X."

The hybrid pattern, which is what production GraphRAG implementations actually do, uses an LLM extractor with the curated ontology as a prompt-level constraint. The LLM is told the canonical type vocabulary and the canonical relation vocabulary, and asked to extract triples that conform. Microsoft's GraphRAG,6 HippoRAG,7 and LightRAG all do some variant of this. The ontology is what makes the LLM's extractions reproducible.

. . .

A Worked Oil and Gas Ontology

A working ontology for upstream oil-and-gas operations might have a spine that looks like the one below. The five top-level classes anchor color families; subClassOf edges form the taxonomy; object properties cross between classes; datatype properties reach literal values like Date and Length. Click a class to isolate its incident relations. Click an axiom to see which classes and edges the constraint touches, alongside the OWL/Turtle fragment that makes it machine-checkable.

The same ontology spine as an interactive schema graph: classes, properties, and the axioms that constrain them.

The full ontology would be larger. What appears here is the part that any production deployment needs: enough classes to type the entities a parser will encounter in operations documents, enough properties to wire them into a graph that answers multi-hop and aggregating queries, and enough axioms to catch the contradictions extraction inevitably produces.

From this spine, a parser can extract structured triples from operations documents and merge them into a knowledge graph that supports the multi-hop, aggregating, and constraint-checking queries listed earlier. The graph is the substrate for everything GraphRAG does at retrieval time. The ontology is what makes the substrate trustworthy.

. . .

The Maintenance Problem

Ontologies drift. New entity types appear in the world; existing types subdivide as terminology refines; industry conventions shift. The maintenance burden has three distinct flavors.

Schema evolution. New classes and properties have to be added. Existing classes sometimes have to be split (a generic DrillingEvent gets refined into VerticalDrilling, LateralDrilling, ReDrilling). Properties can be deprecated when their original meaning becomes ambiguous. The instance graph then needs migration to the new schema, which requires careful versioning and provenance tracking.

Vocabulary drift. Terminology changes faster than the schema. "Permian Basin" was once specific enough; current usage distinguishes the Permian Delaware sub-basin, the Permian Midland sub-basin, and the Central Basin Platform. The controlled vocabulary needs continuous updates, and the synonym tables need to map old terms to current ones without losing the historical labels that older documents use.

Quality control. Extraction is never perfect. A small but steady fraction of extracted triples will be wrong. Quality-assurance loops with human review, downstream consistency checking against axioms, and provenance-tracked retraction are part of the maintenance burden, not optional polish.

The honest framing of ontology-driven parsing is that it is not a one-time project. It is a continuous practice of curating the schema, the vocabulary, and the extraction quality. Teams that treat the initial ontology delivery as the finish line end up with a graph that diverges from reality, and a GraphRAG system whose answers degrade quietly over months. Teams that staff ongoing ontology maintenance produce knowledge graphs that stay correct.

. . .

References

Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.

  1. Gruber, T. R. (1993). "A Translation Approach to Portable Ontology Specifications." Knowledge Acquisition. The classic paper that defined an ontology as "an explicit specification of a conceptualization"; the definitional anchor for the field.
  2. W3C. (2014). "RDF 1.1 Concepts and Abstract Syntax." W3C Recommendation. The canonical specification for the RDF data model.
  3. W3C. (2012). "OWL 2 Web Ontology Language Document Overview." W3C Recommendation. The canonical specification for OWL 2; the entry point for cardinality constraints, transitivity, equivalence, and the other expressive constructs that make ontologies queryable.
  4. W3C. (2009). "SKOS Simple Knowledge Organization System Reference." W3C Recommendation. The canonical specification for SKOS, designed for thesauri and controlled vocabularies rather than logical reasoning.
  5. Guha, R. V., Brickley, D., & Macbeth, S. (2016). "Schema.org: Evolution of Structured Data on the Web." Communications of the ACM. The reference history for the schema.org effort and the pragmatic principles that drove its design.
  6. Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Microsoft Research. The Microsoft GraphRAG anchor paper, with the LLM-extracted-graph pattern and the global-versus-local query distinction that motivates much of the typed-extraction discipline.
  7. Gutierrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). "HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models." NeurIPS 2024. One of the competing systems that uses LLM-driven knowledge graph construction with personalized PageRank traversal; an instructive comparison to Microsoft's GraphRAG approach.
Ontology Knowledge Graph GraphRAG RDF OWL Typed Extraction Information Retrieval