← All Articles

What Is an Ontology?

An ontology is a description of things that exist and how they relate to each other. The instinct to write such a description down is older than computing by some two and a half thousand years, and the artifact still goes by many names: taxonomy, thesaurus, controlled vocabulary, information model, knowledge graph. Each of those names captures part of what an ontology is. This article walks the whole thing, from Aristotle's Categories to the W3C OWL standard.

An old idea

The instinct to classify what exists is older than computing by roughly two and a half millennia. Aristotle's Categories, written around 350 BCE, is the first surviving attempt in the Western tradition to enumerate the kinds of things the world contains: substances, qualities, quantities, relations, places, times, postures, possessions, actions, and affections. Aristotle's project was to ask what kind of statement could be made about a thing, and to fix the structure of those statements so an argument about the world could be checked. The list of ten was less important than the move of writing down a list at all.

A New Yorker style cartoon of Aristotle being praised for his ontology
dude ... nice Ontology

The work has continued under different banners ever since. Linnaeus produced the binomial taxonomy of living things in the 18th century. Mendeleev arranged the elements into the periodic table in the 19th. The U.S. National Library of Medicine has been editing the Medical Subject Headings since 1960. The Gene Ontology began in 1998. Industry standards like the Financial Industry Business Ontology and Schema.org are the most recent iterations of the same impulse. None of these artifacts are interchangeable, but each one answers the question Aristotle asked: what exists in this domain, and how do the things that exist relate to one another?

Computer-science ontologies are the digital descendants of this lineage. The vocabulary changed, the implementation became machine-readable, and the constraints became formal, but the underlying project did not. Tom Gruber, in 1993, gave the field its canonical definition: an ontology is "a specification of a conceptualization." The phrase is precise and has done little to further the understanding of ontologies in practice. Chris Welty offered something a working engineer can put to use: "a description of things that exist and how they relate to each other." Both descriptions name the same artifact. This article uses the second one because it survives translation into code.

. . .

Two sides of the same coin

An ontology model is:

The purpose of natural language processing is:

Ontology-driven NLP means using a semantic model to understand what exists in unstructured data. NLP-driven ontology modeling means using natural language processing techniques to derive semantic models from unstructured data. The two are not separable in practice. Using ontologies with NLP allows an enterprise to turn data into knowledge.

Ontologies provide semantic context. Identifying entities in unstructured text is a picture only half complete. Ontology models complete the picture by showing how these entities relate to other entities, whether in the document or in the wider world.

. . .

Four building blocks

Almost every ontology, regardless of size or domain, is built from the same four kinds of thing.

Classes are the types of thing the domain contains. In a library catalog ontology, the classes might be Book, Author, Publisher, and Library. In a medical ontology, the classes might be Disease, Symptom, Treatment, and Patient. A class is a name for a category whose members share some structural properties.

Properties are the attributes the members of a class can have. A Book has a title, an ISBN, a page count, and a publication year. An Author has a name and a birth date. Properties are typed: a publication year is an integer, a title is a string, a birth date is a date. The ontology fixes the type, so a downstream tool can validate values.

Relationships are the typed connections between members of classes. A Book is written by one or more Authors. A Book is published by one Publisher. A Library holds many Books. Relationships have a domain (the class they originate from) and a range (the class they point to), and an ontology records both.

Constraints are the rules that must hold. Every Book must have at least one Author. A publication year must be between 1450 and the current year. The written by relationship cannot point a Book at itself. Constraints are what raise an ontology above a vocabulary and turn it into something a machine can use for validation and reasoning.

Those four pieces (classes, properties, relationships, constraints) are the whole basic vocabulary. Everything else in the ontology literature, from RDF to OWL to the domain-specific ontologies used in industry, is some refinement or extension of this same structure.

The four building blocks over a Middle-earth ontology.
. . .

A worked example: a small Middle-earth ontology

Concrete is easier than abstract. Here is a small ontology for Tolkien's Middle-earth, written in Turtle (TTL), the readable serialization of RDF that most ontology authors work in by hand. The same graph can be rendered in RDF/XML, N-Triples, JSON-LD, or Notation3 without losing or adding a single triple; the companion demo shows each rendering side by side, plus a force-directed view of the underlying graph.

@prefix :     <https://craigtrim.com/ontologies/lotr#> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

# Classes form a small taxonomy.
:Being a owl:Class .
:Hobbit a owl:Class ; rdfs:subClassOf :Being .
:Maia   a owl:Class ; rdfs:subClassOf :Being .
:Wizard a owl:Class ; rdfs:subClassOf :Maia .
:Balrog a owl:Class ; rdfs:subClassOf :Maia .
:Ring   a owl:Class .
:Realm  a owl:Class .

# Datatype properties (Hobbit-specific shown).
:age      a owl:DatatypeProperty ; rdfs:domain :Hobbit ; rdfs:range xsd:integer .
:homeTown a owl:DatatypeProperty ; rdfs:domain :Hobbit ; rdfs:range xsd:string .

# Object properties typed by domain and range.
:bears    a owl:ObjectProperty ; rdfs:domain :Being ; rdfs:range :Ring .
:livesIn  a owl:ObjectProperty ; rdfs:domain :Being ; rdfs:range :Realm .
:mentors  a owl:ObjectProperty ; rdfs:domain :Maia  ; rdfs:range :Being .

# Instances carry their datatype values AND their object-property links inline.
:Frodo a :Hobbit ;
    :age 50 ; :homeTown "Hobbiton" ;
    :bears :OneRing ; :livesIn :Shire .

:Gandalf a :Wizard ;
    :color "Grey" ;
    :bears :Narya ;
    :mentors :Frodo .

# A constraint: every Ring was forged in exactly one Realm.
:Ring rdfs:subClassOf [
    a owl:Restriction ;
    owl:onProperty   :forgedIn ;
    owl:cardinality  "1"^^xsd:nonNegativeInteger
] .

This excerpt omits the rest of the instances (Bilbo, Sam, Saruman, Radagast, the Balrogs, the other Rings, the Realms) and a few constraints. The full source is in builder/lotr.ttl next to this article, and a small Python script (builder/compile.py) uses rdflib to parse it, round-trip it through every target serialization, and fail the build if any conversion loses a triple. The 126 triples in the source survive every round trip.

Three things to notice about this ontology.

First, the class taxonomy carries inferences. Wizard rdfs:subClassOf Maia and Maia rdfs:subClassOf Being means any tool that asks "is Gandalf a Being?" gets back yes, even though no triple says so directly. The reasoner walks the subclass chain.

Second, the relationships are typed. bears does not merely connect two things, it connects a Being to a Ring. Any tool that follows this relationship knows what kind of thing it will find on the other end, and can compose further queries accordingly: "give me every Maia who bears a Ring."

Third, the ontology is independent of any particular set of instances. Frodo and Gandalf above are illustrative; the ontology would describe Middle-earth just as well if no instances had been written down yet. That separation is what makes ontologies reusable: the same shape can underpin many different fact bases.

The same ontology in five RDF serializations and one force-directed view.
. . .

From text to typed graph

An ontology becomes useful the moment unstructured text is lifted onto its types. Consider a single English sentence:

William Shakespeare was an English playwright. In 1601, he wrote Hamlet.

Three named entities (William Shakespeare, Hamlet, 1601) and one nationality (English) carry almost all the information in the sentence. An ontology lets us assign each of them a class, and lets us assign the relationships between them their own typed names. After lifting, the sentence is no longer prose. It is a small typed graph that a machine can query.

A sentence about William Shakespeare lifted onto a typed graph of ex:Playwright, ex:Play, ex:Country, and ex:Year nodes connected by writes, citizenOf, and publishedIn arrows
Prose lifted onto a typed graph.

Two kinds of arrow run through the diagram. The dashed arrows lift each prose entity onto a class: William Shakespeare to Playwright, England to Country, 1601 to Year, Hamlet to Play. The solid arrows are the typed relationships the ontology asserts between those classes: a Playwright writes a Play, a Playwright is a citizenOf a Country, a Play is publishedIn a Year.

The diagram above carries several annotations even though an NLP parser may have recognized only two of them: William Shakespeare as a Playwright, Hamlet as a Play. The other annotations come from the ontology, which knows that a play has a publication date, a play is written in a language, and the writer of a play is a citizen of a country. It is in the ontology that we specify how a Play is related to a Date, how a Play is related to a Language, and how a Language is related to a Country to an Author, to a work produced by that Author, and so on.

This is why the ontology matters more than the parser. Even when only two entities are annotated in the text, the ontology shows what an additional pass should look for. If the parser has recognized the author and the title, what has it not recognized? Every play has a publication date. The date is in the text; the parser should be looking for it. Every play is written in a language. The language is in the text; the parser should be looking for that too. The ontology gives the parser its next instruction. It also gives the operator a check on completeness: a record with no date, no language, and no country is a record the parser has not finished reading.

The ontology gives us the relations that exist between annotations. It helps us understand each annotated token in the larger context of a semantic chain. It also helps us understand what information we are missing, and what else we need to look for. A growing ontology suggests new typed extractions a pipeline should attempt next. The ontology guides the extraction; the extraction populates the ontology. Each one makes the other more useful over time.

. . .

The standards

In practice, ontologies are written in one of a handful of W3C standards. The standards have grown by accretion since the late 1990s, each layer adding more expressive power.

RDF: the triple

The Resource Description Framework (RDF) is the foundation. RDF represents every claim about the world as a triple: subject, predicate, object. The Middle-earth ontology's "Frodo bears the One Ring" becomes one triple. "Frodo lives in the Shire" is another. The entire ontology, and every record that conforms to it, is a graph of triples.

:Frodo    :bears    :OneRing .
:Frodo    :livesIn  :Shire .
:OneRing  :forgedIn :Mordor .

The colons mark identifiers. The dots mark the end of each triple. The graph that emerges from a collection of triples is the data; the ontology says what kinds of triples are well-formed.

Turtle, RDF/XML, JSON-LD: the serializations

RDF is an abstract data model. The triples have to be written down in some concrete syntax before a parser can read them, and the W3C has standardized several. Turtle (file extension .ttl) is the readable form most ontology authors work in: prefixes at the top, semicolons to share a subject across statements, square brackets for blank nodes. The two examples above and the Middle-earth ontology earlier in this article are all Turtle. RDF/XML is the original 1999 serialization, verbose but still common in archival data. N-Triples is the line-oriented form, one triple per line, useful for streaming and diffing. JSON-LD is the modern web form, designed so JSON consumers can read RDF without learning RDF. All four serialize the same underlying graph and any RDF library will convert losslessly between them. The companion syntax viewer renders the Middle-earth ontology in each format from a single source TTL.

RDFS: classes and properties

RDF on its own is just triples. RDF Schema (RDFS) adds the vocabulary for classes and properties. RDFS lets you say "Book is a class," "writtenBy is a property whose domain is Book and whose range is Author," and "every member of class Novel is also a member of class Book." The class hierarchy, the type system, and the domain-range typing on properties all live in RDFS.

OWL: constraints and reasoning

The Web Ontology Language (OWL, with the letters reordered for euphony) is the layer where real constraints live. OWL lets you say "a Book must have at least one Author," "the writtenBy relationship is the inverse of the wrote relationship," and "no individual can be both a Book and a Library." OWL also enables automated reasoning: if the ontology says Novel is a subclass of Book and War and Peace is a Novel, a reasoner can derive that War and Peace is a Book without that triple being asserted directly.

SKOS: taxonomies and controlled vocabularies

SKOS (Simple Knowledge Organization System) is a lighter standard, built on top of RDF, for representing controlled vocabularies and thesauri rather than full ontologies. SKOS gives you broader-than, narrower-than, and related-to relationships, plus preferred and alternative labels. Library catalog subject headings, museum classification systems, and most multilingual taxonomies are written in SKOS rather than full OWL.

Schema.org: the practical standard

Schema.org is the most widely deployed ontology on Earth, by a large margin. Maintained jointly by Google, Microsoft, Yahoo, and Yandex, it provides a shared vocabulary for marking up structured data on web pages. When a recipe website tags a page as a Recipe with properties for cookTime, ingredients, and nutrition, the search engines know what they are looking at. Schema.org is technically expressed in RDFS plus a few OWL constructs, but its real power is the social fact that almost every major platform agreed on the same vocabulary.

. . .

Ontologies in the wild

Once you know what to look for, ontologies are everywhere. A short tour of working ones.

Ontology Domain Scale (approx.)
Schema.org The web ~1,000 classes, billions of pages
Wikidata General knowledge ~110M items, ~12k properties
DBpedia Wikipedia, as RDF ~7M entities, ~1B triples
MeSH Medicine (NLM) ~30k descriptors, ~640k entry terms
Gene Ontology Biology ~44k terms, ~8M annotations
FIBO Finance ~11k classes, drafted by the EDM Council
FOAF People and social ties ~70 terms, widely embedded

These are not academic exercises. MeSH is the indexing vocabulary for PubMed, the medical literature search engine used by every working clinician. The Gene Ontology underpins functional annotation in genomics. FIBO is the EDM Council's vocabulary for regulatory reporting in finance. Schema.org markup is the input to every modern search engine's structured-data layer. Behind each is the same machinery: classes, properties, relationships, constraints.

. . .

What an ontology gives you that text does not

The reason to write an ontology at all is that some questions cannot be answered from text. Consider three.

Is this record consistent? A library catalog with a book published in 1962 but tagged with an ISBN is inconsistent (the ISBN system did not exist until 1970). The ontology, with its constraint on ISBN availability, can flag this. A text search cannot, because nothing about the record looks textually wrong.

What is implied by what I know? If the ontology says every Novel is a Book and every Book has at least one Author, then a record asserting "War and Peace is a Novel" implies that War and Peace has an Author, even if no writtenBy triple appears explicitly. A reasoner can derive that. A pile of text cannot.

What can I rule out? If the ontology says the writtenBy relationship is irreflexive (a book cannot be its own author), then any record asserting otherwise is wrong. The constraint is not a heuristic; it is a guarantee, machine-checkable, with no false positives.

These three capabilities (consistency checking, inference, ruling out) are what an ontology provides that no amount of unstructured text or similarity-based search ever will. A vector embedding can tell you "Tolstoy" and "Dostoyevsky" cluster together. It cannot tell you that the claim "Tolstoy wrote The Brothers Karamazov" is wrong.

. . .

Cost and payback

Authoring an ontology is expensive. Someone has to sit down with a domain expert, agree on what classes exist, agree on which properties matter, write out the relationships, write out the constraints, debate edge cases, refactor as the domain reveals corners the first draft missed. The Gene Ontology took years to mature. Schema.org has been actively edited since 2011. FIBO has been in development since 2008. There is no shortcut.

The payback is precision, and the payback profile is highly uneven across domains. Three rough patterns.

Where the payback is large: domains where ambiguity has expensive consequences. Medicine, finance, law, scientific publishing, aviation, pharmaceutical regulation. In these domains, a wrong drug dose, a misinterpreted financial instrument, or a miscategorized legal precedent has real costs. The cost of authoring the ontology amortizes quickly against the cost of preventable errors.

Where the payback is moderate: domains with much repeated reasoning over the same structure. E-commerce product catalogs, library and museum collections, government open data, organizational knowledge bases inside large enterprises. The ontology pays back because thousands of queries hit the same structure thousands of times.

Where the payback is small or negative: domains where the structure shifts faster than the ontology can be maintained. Social media topics, transient consumer interests, anything whose vocabulary turns over every six months. Writing an ontology for a domain that will be unrecognizable in two years is a poor use of the authoring cost.

A useful rule of thumb: if the domain has a regulator, a standards body, or a community that has already done part of the work, reuse what exists and extend it. If the domain has none of those, the ontology will be expensive to author and likely incomplete; consider whether the domain actually needs an ontology, or whether a simpler vocabulary or schema would do.

. . .

Ontology or relational database?

Why would we use an ontology over a relational database? The two are not interchangeable. They make opposite assertions about the domain.

The use of a relational database is to make the following assertion: I understand the data that exists in my domain completely, and the data is relatively static. Changes may happen, but the design of the entity-relationship diagram must remain relatively static for applications to build effectively on top of it. The relational model has relationships between entities established through explicit keys, and for many-to-many relationships, associative tables. Changing those relationships is cumbersome, because it requires changes to the schema itself, which is difficult once a database is populated.

The use of an ontology model is to make the opposite assertion: I do not fully understand the data that exists in my domain, I know I will never understand it completely, and far from being static, the data changes constantly.

The graph model that semantic systems use makes it much easier to both query and maintain the model once deployed. If a new relationship needs to be represented that was not anticipated during design, a new triple is simply added to the store. The relationships are part of the data, not part of the database structure. That difference is the entire reason ontologies became the substrate for domains where the schema cannot be frozen.

. . .

How an ontology grows from a taxonomy

An ontology is a taxonomy with semantic relationships layered on top. The fastest way to author one is to start with the taxonomy and let the rest accrete.

Start with the hierarchy. Pick the categories your domain cares about and write the parent-child links. Mammal rdfs:subClassOf Animal. Author, Country, Language, Book, Date. At this stage the ontology is a taxonomy and nothing more, but it is already useful: the hierarchy alone gives you inheritance and a shared vocabulary.

Add types. Once the categories are in place, anchor individuals to them with rdf:type: William Shakespeare rdf:type Author. The test for whether a triple should use rdf:type is the English "is-a" check: pose the claim out loud. "William Shakespeare is-a Author" reads cleanly, so the link is a type assertion. "Steering Wheel is-a Car" does not read cleanly; the relationship between a steering wheel and a car is part-whole, not type. That is a partonomy, which is its own semantic relationship and needs its own predicate, not rdf:type.

Then add semantic relationships. Once the typed taxonomy is in place, the typed predicates that connect concepts to one another (writes, publishedIn, citizenOf) become useful. There are several reasons to add them, none mutually exclusive:

A thesaurus is a genuinely different artifact, and worth distinguishing here so the two do not get conflated. A thesaurus relates words to other words (synonyms, broader and narrower terms). An ontology relates concepts to other concepts, where concepts have types and properties. The word "bank" appears in a thesaurus under both "financial institution" and "river edge"; an ontology distinguishes the two as members of different classes. SKOS exists for the case where you want a controlled vocabulary or a thesaurus without the full machinery of OWL.

. . .

Where to start

If the goal is to learn the shape of an ontology by writing one, start small. Pick a domain you know well that has a clear small set of classes (a personal library, a recipe collection, a sports league, a music catalog). Aim for ten to twenty classes and properties. Use a freely available editor such as Protégé to write the ontology in OWL. Validate it against a reasoner. The first ontology will be wrong in interesting ways, and the second pass will be cleaner than the first because the act of writing surfaces all the assumptions the domain expert was making implicitly.

If the goal is to use an ontology rather than author one, look first at what already exists. The domain you care about almost certainly has a working community ontology somewhere. Reuse is cheaper than reinvention, and the ontology you reuse will have been hardened by people who hit the corners you have not yet hit.

. . .

References

  1. Gruber, T. R. (1993). A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2), 199-220. The canonical definition: "an ontology is an explicit specification of a conceptualization."
  2. Guarino, N. and Welty, C. (2002). Evaluating Ontological Decisions with OntoClean. Communications of the ACM, 45(2), 61-65. The OntoClean methodology and the stricter logical-theory framing of what counts as an ontology.
  3. Studer, R., Benjamins, V. R., and Fensel, D. (1998). Knowledge Engineering: Principles and Methods. Data & Knowledge Engineering, 25(1-2), 161-197.
  4. Noy, N. F. and McGuinness, D. L. (2001). Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05.
  5. W3C. Resource Description Framework (RDF).
  6. W3C. RDF Schema 1.1.
  7. W3C. OWL 2 Web Ontology Language.
  8. W3C. SKOS: Simple Knowledge Organization System.
  9. Schema.org. Joint vocabulary for structured data on the web.
  10. Wikidata. The Wikimedia Foundation's structured knowledge base.
  11. Gene Ontology. The standard ontology for gene function annotation.
  12. U.S. National Library of Medicine. Medical Subject Headings (MeSH).
  13. EDM Council. Financial Industry Business Ontology (FIBO).
Ontology Knowledge Representation Semantic Web RDF OWL Memory and Knowledge