In order to ensure semantic interoperability, in an ideal world there would be a shared understanding of what concepts and data elements mean. Mapping between terms in different ontologies or between data elements in different formats is alright, but there are deeper issues of how people struggle to represent meaning in computer systems made for others whose model of the world might not be (exactly) the same.
A striking example for the difficulty of semantic interoperability is a Linked Data challenge which sought to answer the question: “Which town or city in the UK has the highest proportion of students?“. One answer puts Cambridge first (you’ll notice the quite obvious mistakes in the data), while another sees Milton Keynes on top. Without digging too deep into the details, one can see that it’s important to make sure the definition of “town”, “city” or “student” is the same in all data sources (Wikipedia, government data…), and to formulate a precise enough query.
The nuances of meaning make a huge difference here. A casual user is unlikely to get the semantics exactly right to match these nuances. Can there be a way to design systems that copes with these intricacies, that can dynamically incorporate context-sensitive and domain-specific semantics, semantic changes over time, locally negotiated semantics as opposed to universal approaches?
While syntactic interoperability is a prerequisite for effective data interchange, a whole host of other factors such as the source of the data, the target group and use cases for which it was produced have to be taken into account when re-using pieces of data. Context influences the semantics of data elements and of their content (a certain concept can have different meanings in different knowledge communities). I’d call this level of interoperability “pragmatic”, drawing on a definition of pragmatics as a subfield of linguistics: it “studies the ways in which context contributes to meaning.”
How are data and data elements actually used in context, and used differently in different contexts? Identifiers and identity conditions may vary depending on context. What about incorrect or inconsistent use of data elements? And what is/are the underlying model(s) that enable applications to share data and interpret them correctly in the given domain? If, for example, we don’t agree on the concept “book” and its properties, how can we effectively share and make sense of data about it? Is data still as valuable when torn out of its original context (Linked Data)?
I suggest that data pragmatics could look at the way people use formats, vocabularies etc. in practice as opposed to a theoretical, top-down, prescriptive view. With the number of people creating and publishing data on the web growing at an unprecedented rate, in this “open world”, we are bound to end up with data of varying quality that may or may not be interoperable on all levels.
Lately some issues and questions have been nagging at me which I’ll explore in future posts.
With linked and open data proliferating in all kinds of domains, interoperability becomes critical. What is syntactic interoperability? In short: If two systems use the same data formats and protocols, they can exchange data without additional effort. Using standards like MARC (encoded in XML) or Z39.50 (to speak for the library world) certainly helps distributed systems to interoperate. But even then the data very often doesn’t work “as is”. Consider, for example, the different “dialects” of MARC. Even though library data is based on the same structural standard, MARC, and on the same rules, AACR2, there are rule interpretations or subtle differences in the use of certain fields from library to library or network to network. More generally, people bring their views, decisions and own internal models of the world to the creation and later use of data (and even to reading a documentation of formats or standards).
Interoperability doesn’t stop at the syntactic level. Adopting syntactic standards is only the basis of the bigger picture. If all data providers used RDF and URIs, would that make all data easily understandable (for machines and humans) and interoperable? Hardly. As William Kent showed in his paper “The many forms of a single fact” (1988), “severe problems of heterogeneity can exist between databases containing the same information in the same type of data structure under the same database processor.” For the moment, I won’t talk about more abstract levels of interoperability – I’ll save that for the next couple of posts…