The Representational Limits of Ontologies

This post aims to provide a high-level overview of some of the ways in which falsehoods can be created and obscured within ontologies, and how sometimes things that can be evidenced cannot be represented within them. My aim is to demonstrate that ontologies are a lossy format, and that there are specific things we can say that we lose when we use them.

I want to start by defining some terms, including what ‘an ontology’ is. In this context, an ontology is a kind of data store that aims to represent arbitrary information in a graph format. People who write about ontologies will sometimes refer to the resulting ontologies as ’linked open data’ (this implies a commitment to re-using commonly agreed identifiers within the document) and sometimes as ‘RDF triples’ (RDF stands for ‘resource description framework’) which refers to an underlying set of restrictions on how the data is formatted. Triples are an ordered group of three things, and in RDF the position of the thing in the triple marks its role as either the subject, the predicate, or the object of a given relationship. The subject and object are often referred to as ’entities’ and the predicate is often referred to as ‘a relationship.’ An example of a possible triple could be ‘houses ARE A TYPE OF building’ or ‘boats FLOAT ON water.’ So there is quite a lot of flexibility in what can be expressed as part of an ontology. In fact, they are so-called because they are inspired by the philosophical concept of ‘ontology’, which can be defined (in simplistic terms) as the study of categories and relationships.

I think the best way to think about ontology in this context is as a subset of language. It makes sense that if you have a fragment of a language then you may only be able to express yourself in a fragmentary way. I need to say that I did the vast majority of the research for this essay in 2023 as part of my Computer Science BSc’s major project. It’s possible that the field has moved on by now, but at the time I felt that there was a distinct lack of consideration for what the boundaries of this particular subset of language might be. This really impacted the way my project went, because I was working from sources which used a lot of (digital) ink to explain how useful ontologies might be without spending time discussing what they might not be used for. For an extended period of time, it left me with the impression that if I was failing to express something in an ontology then it was a failing on my part to properly express myself or capture the true nature of the data I was working with. So when I say ‘ontologies are a lossy format,’ what I mean is ‘it may not be avoidable that an ontology can’t capture particular information without encoding a falsehood.’

It may be helpful to look at the exploration of this topic in China Miéville’s science fiction novel Embassytown. A civilisation in this book communicates in a language that could be understood as fully dereferenceable. Every word must refer to something that either exists or existed, an event that either happened or is currently happening. They can only communicate in terms of relationships of these things to one another. They cannot say that two things are the same because the semantic collapse would imply a physical one. Miéville posits that, for such a communication system to function, its speakers must be psychologically incapable of lying, or more accurately, that they can have no conception of truth. This language bears more than surface-level similarities to ontologies, so I suggest considering it as a thought experiment. In such a language, how might your communication be restricted?

So why now? I think that ontologies have become a lot more useful in the last three years. At least, they have become more used. They have applications in the area of large language models: generative AI can be used to generate ontologies and ontologies can be used as inputs to generative AI. I think that the interest stems from the idea that ontologies, by nature, capture facts in a concrete way that large language models fail to. In 2023, the big promise of ontologies was their projected capability of joining disparate data sources together as ’linked open data’ in the ‘semantic web’. This never really came to fruition, but new developments have sparked new interest. I am not overly interested in litigating whether ontologies can be used to scaffold generative AI or whether generative AI can usefully generate ontologies. That is not exactly my area. It didn’t exist in the same form when I was doing the research. What I am interested in saying, speculatively, is that it is possible that the flaws of ontologies that I intend to describe and the flaws of large language models (e.g. conflation and ‘hallucination’) may overlap and interact in ways that aren’t beneficial. However, I have to leave working that one out as an exercise for the reader.

There are two things that really bother me about many writings about ontologies. One is a total focus on things that are possible or easy to express within the language of an ontology, to the detriment of things that are impossible or difficult. This could have been influenced by the tone of early writings about ontologies. Take a look at this web page from the year 2000 by Tim Berners-Lee, Dan Connolly, Lynn Andrea Stein, and Ralph Swick. The writers really emphasise the potential expressiveness of the format of ontologies. What I think the authors of these documents thought was that it was more important to structure knowledge in a way that was potentially computationally tractable by future technologies than worry about whether it was actually tractable at the time of writing:

The fundamental point at which the semantic web diverges from the KR [Knowledge Representation] systems of the past is that it puts to one side – for later – the problem of getting computers to answer questions about the data. The trick is to make the rules language as expressive as we need to really allow the web to express what is needed.

The other thing that bothers me is a proclivity towards referencing other areas of philosophy as if they provide evidence that whatever the author might want to express in an ontology is, in fact, expressible. We can refer to this paper by John Sowa, also from the year 2000, for an early example.

Sowa uses Peircean semiotics to describe how RDF triples can be used to form different types of expressions, including chained expressions. To provide a quick definition of Peircean semiotics, this refers to the study of meaning-making as explored by the philosopher Charles Sanders Peirce and subsequent philosophers who developed his ideas further. The idea of the ‘semiotic triangle’ utilised by Sowa and credited to Charles K. Ogden and I. A. Richards is originally derived from Peirce’s work. The semiotic triangle is used to deconstruct individual units of meaning or ‘signs’ into three constituent components: a ‘symbol’ which might be, for example, a written or spoken word, an ‘immediate object’ which the sign is seen to refer to in the mind of the person interpreting it, and the dynamic object, which, if it exists, exists in the real world. Peirce’s work is not a full and complete depiction of language, but it presents tools that help us break it down and think about it. This is not really how Sowa uses it, though. In fact, I would argue that Sowa’s presentation of the triangle as if there is a direct and unmediated connection between the symbol and the dynamic object betrays a characteristic of ontologies that Sowa doesn’t give full consideration to.

Whether an ontology can be interpreted rests on the ability to consistently find or ‘dereference’ the dynamic object. There cannot be a distinction between the immediate and dynamic objects. So it follows that the statements Sowa constructs are all either universal statements that apply to every instance of a thing or they are statements about named individuals and their activities. Statements like “Some cats have green eyes” or “All cats except Yojo are cute” (sorry, Yojo) or “A cat may have existed called Yojo” aren’t used as examples because they are fundamentally difficult or impossible to express in an ontology.

The idea that anything would not be possible to express in an ontology is something that, in 2023, was still contentious. I will refer to a quote from Terhi Nurmiko-Fuller’s book, Linked Open Data for Digital Humanities. She writes:

In his opening statement to Tractatus Logico-Philosophicus, Wittgenstein (1922) quotes Kürnberger to say that “whatever a man knows … can be said in three words”. Similarly, one of the cornerstones of the Linked Data approach is that all knowledge can be captured in statements of three components. We call this the RDF (or Resource Description Framework) triple.

I think that we must recognise Nurmikko-Fuller’s reference to Wittgenstein as an appeal rather than, in a strict sense, an argument. Additionally, while Nurmikko-Fuller does, in a separate part of the book, introduce the nuance that everyone has biases and therefore what we present as facts in an ontology will be affected by our perspectives, she seems to take the stance that any perspective can be captured by “unambiguous relationships between unambiguous entities” and so the problems of truth-telling lie in the ontology creators, and then only because of that relationship in the ontologies themselves. I really disagree with this. I think that it is important to consider author bias, but what about the biases that are fundamental to the data structure? In my opinion, those are more fundamental and at least equivalently destructive.

What it means for the identity of a relationship or entity to be ambiguous is key to meaning-making in ontologies. It is also key to the creation of falsehoods that are propagated in RDF datasets. Fittingly, there are many ways for identity to be ambiguous, so even the concept of ambiguity is hard to pin down. However, what can be generally said is that if there is any ambiguity or even uncertainty about the truth value of a statement then it is not even possible to talk around a topic within an ontology, let alone discuss it directly. This is the loss that ontologies require. If this communication is attempted regardless, the result is a kind of twisting of the truth, a lie that is encoded in data, that can be referenced and put into question the reliability of any reasoning that used it in its process.

There’s an influential Umberto Eco quote on semiotics that seems relevant. Eco wrote in 1979, “If something cannot be used to tell a lie, conversely it cannot be used to tell the truth: it cannot in fact be used ’to tell’ at all.” With our external context, when interacting with ontologies, we may be able to spot incorrect or inaccurate information. But reasoning solely within an ontology, the framework assumes that everything is factual. It is possible to state in an ontology that you know that a statement is not true, or define points along a gradient of certainty that can be referenced later. But even representing certainty as a continuum is fraught when dealing with potentially complex information. I think that it would be uncontroversial to say that ontologies can contain falsehoods due to error. However, the only untruths that can be represented as they are are those that we can be, in some way, certain about. This is a very serious restriction. We don’t always know what it is that we don’t know.

I want to give one example of something that cannot be discussed in an ontology but can be meaningfully discussed in natural language. That is, the issue of historical persons who are not seen as important figures and are therefore less well-attested. The subject of my major project was Victorian-era mariners. What I found was that I was not able to meaningfully produce an ontology because while I could say something along the lines of “There was a mariner who went by the name of David Jones on a vessel that was called The Iris. He said that he was 43 years old in 1880. This combination of details matches information from other shipping records, so he may have had a longer career, though we don’t know because he has a common name and we also don’t know what records we’re missing,” and actually mean something, I could not capture this in an ontology without making some claim to the unique identifiability of the mariner known as David Jones as well as a level of certainty as to the veracity of the information contained in the record. I didn’t have either of these things. To produce the intended outcome of the project, I would effectively have had to lie.

This is an invented example but incredibly typical for this type of information. This is something it is important to identify before setting out to produce an ontology. I didn’t, in part because I had based my work on two other groups of studies that had produced maritime ontologies from similar records in a similar time period and I thought that by following similar methods I would get similar results. What I eventually found was that these studies had issues with their validation questions that were intended to display that the data could be queried in the expected ways. For example, the studies would not include questions that required identifying individual people and vessels in those sets of questions, despite this being an integral part of the information they were trying to capture. There’s a violence to all this that reproduces the damage that has already been done in determining who and what will be remembered and how. It renders people who have already been made marginal in some way fully unintelligible, fully unspeakable. While this may not be an issue in all areas of application, I worry that an uncritical approach is more common than not, especially impacting fields where what is left unsaid is deeply important.

Given that the word “ontology” originates in philosophy, it seems particularly important to engage with philosophy for reasons other than seeking validation for why things ought to work. Returning to the works that have been used to justify building ontologies and looking beyond them to works which might help to identify structural and methodological issues can only help with its development as a field. If we can’t express at least some of what we won’t be able to do then it’s going to be difficult to know whether or not an ontology is an appropriate tool for a given task. Something I’ve observed in discussions of ontology is that examples start and end with highly simple examples. Can you describe a pizza? Can you describe a heteronormative nuclear family? It’s really, really difficult to start from these simple examples and move on to complex ones. I think it is important to start with the complex questions. Is there a data structure that can accommodate the information? How much do you have to twist it to fit into the constraints of an ontology? If you can’t answer the question for particular aspects of your domain you might not be able to even consider those concepts as part of an ontology.

The general sentiment I have come to is that if nothing is laid out to demonstrate the edges of what is possible, then it is going to be incredibly difficult to know what is feasible. The rules of discourse within ontologies are restrictive. These restrictions can coerce the data we want to represent into a different form: changing its meaning, eliding things. This is a process that ought to be better understood.