“Semantically Enhancing Collections of Library and Non-Library Content”, article by James E. Powell, Linn Marks Collins and Mark L. B. Martinez, in D-Lib Magazine, Volume 16, Number 7/8, 2010.
I like the authors’ pragmatic attitude:
Although wholesale conversion of large metadata collections to semantic web data may not be a viable option yet, there’s a middle path which may open the door to more advanced user tools while at the same time increasing the relevance of digital libraries. It involves generation of semantically enhanced, focused collections of data.
Their discovery application is an example of data fusion, i.e. merging of data from various sources through mapping between formats. It integrates digital library content with external data to augment the bibliographic metadata and to create an information structure that goes beyond mere bibliography.
Another aspect that caught my eye is their visualization approach. Representing metadata in a graph model might help users navigate and encounter new connections, an option that could also offer some benefit for FRBRized representations of bibliographic data (which seems to remind me of Ron Murray’s FRBR network models of complex bibliographic relationships).
After reading some very informed blog posts about the Google Books Ngram viewer, I’ll throw in my 2 cents too.
The question with visualizations is: what do they actually tell us, in what way are they helpful? They are well suited for providing an overview of large amounts of data and their functionalities will surely be enhanced in the future. However, we should always be aware of the imperfections arising from OCR or metadata glitches and of the limits of such a visualization.
By way of example, the Ngram viewer doesn’t account for changes in semantics of words (or for disambiguation, for that matter). Terms are likely to have different meanings at different points in time. Nor does it tell you more about the cultural context which may be crucial for understanding the distribution of highs and lows in the graph. Counting occurrences is not enough to reflect semantic drifts or contextual information.
All in all, here’s another example of “computers only process symbols” – it’s up to people to create meaning from the results. On the other hand, we wouldn’t be able to explore the abundance of data without such analytical tools.
Patrick Durusau suggests that properties help identify a subject representative in topic maps. For example, he says that
… All naming conventions are subject to ambiguity and “expanded” naming conventions, such as a list of properties in a topic map, may make the ambiguity a bit more manageable.
That depends on a presumption that if more information is added and a user advised of it, the risk of ambiguity will be reduced. …
In another post, he writes:
… Topic maps provide the ability to offer more complete descriptions of subjects in hopes of being understood by others.
With the ability to add descriptions of subjects from others, offering users a variety of descriptions of the same subject. …
I have some doubts whether this description approach, as opposed to the naming approach (subject indicator and subject locator) now in place in topic maps, will work. IMO, adding more descriptions (by different people with different models of the world, opinions, values, perceptions …) increases the risk of introducing divergent viewpoints and thus runs counter to the original intent (to determine identity and reduce ambiguity). Adding more information is likely to make description and reference more ambiguous, as Pat Hayes and Harry Halpin have argued in their paper “In defense of ambiguity”. Common sense might tell you that the more you define something, pin it down, the more precise it becomes. But precision is not necessarily achieved by more descriptions – subject identification rather gets prone to ambiguity.
Some more questions about identity and disambiguation:
- How do you decide that a description has added a dimension to a concept that makes that concept so different that it becomes “another” concept? Who draws the line (someone else might disagree)? Can we formulate criteria for this process, or is it arbitrary?
- How can we handle time- and context-sensitive descriptions – properties that change over time or according to context? Does a property that changes still define identity, and does identity change too, then? Maybe there are “minor” and “major” properties – if a “major” property changes, does that entail a change of identity, whereas with the change of a “minor”, less typical property identity stays the same? How can either of these “property classes” be determined?
I wonder whether we fall back into the Aristotelian system by stressing description, attributes and properties. Or perhaps description and naming strategies in subject identification can coexist and complement each other.
In order to ensure semantic interoperability, in an ideal world there would be a shared understanding of what concepts and data elements mean. Mapping between terms in different ontologies or between data elements in different formats is alright, but there are deeper issues of how people struggle to represent meaning in computer systems made for others whose model of the world might not be (exactly) the same.
A striking example for the difficulty of semantic interoperability is a Linked Data challenge which sought to answer the question: “Which town or city in the UK has the highest proportion of students?“. One answer puts Cambridge first (you’ll notice the quite obvious mistakes in the data), while another sees Milton Keynes on top. Without digging too deep into the details, one can see that it’s important to make sure the definition of “town”, “city” or “student” is the same in all data sources (Wikipedia, government data…), and to formulate a precise enough query.
The nuances of meaning make a huge difference here. A casual user is unlikely to get the semantics exactly right to match these nuances. Can there be a way to design systems that copes with these intricacies, that can dynamically incorporate context-sensitive and domain-specific semantics, semantic changes over time, locally negotiated semantics as opposed to universal approaches?
While syntactic interoperability is a prerequisite for effective data interchange, a whole host of other factors such as the source of the data, the target group and use cases for which it was produced have to be taken into account when re-using pieces of data. Context influences the semantics of data elements and of their content (a certain concept can have different meanings in different knowledge communities). I’d call this level of interoperability “pragmatic”, drawing on a definition of pragmatics as a subfield of linguistics: it “studies the ways in which context contributes to meaning.”
How are data and data elements actually used in context, and used differently in different contexts? Identifiers and identity conditions may vary depending on context. What about incorrect or inconsistent use of data elements? And what is/are the underlying model(s) that enable applications to share data and interpret them correctly in the given domain? If, for example, we don’t agree on the concept “book” and its properties, how can we effectively share and make sense of data about it? Is data still as valuable when torn out of its original context (Linked Data)?
I suggest that data pragmatics could look at the way people use formats, vocabularies etc. in practice as opposed to a theoretical, top-down, prescriptive view. With the number of people creating and publishing data on the web growing at an unprecedented rate, in this “open world”, we are bound to end up with data of varying quality that may or may not be interoperable on all levels.