Category Archives: diversity

Ambiguity and identity

Patrick Durusau suggests that properties help identify a subject representative in topic maps. For example, he says that

… All naming conventions are subject to ambiguity and “expanded” naming conventions, such as a list of properties in a topic map, may make the ambiguity a bit more manageable.

That depends on a presumption that if more information is added and a user advised of it, the risk of ambiguity will be reduced. …

In another post, he writes:

… Topic maps provide the ability to offer more complete descriptions of subjects in hopes of being understood by others.

With the ability to add descriptions of subjects from others, offering users a variety of descriptions of the same subject. …

I have some doubts whether this description approach, as opposed to the naming approach (subject indicator and subject locator) now in place in topic maps, will work. IMO, adding more descriptions (by different people with different models of the world, opinions, values, perceptions …) increases the risk of introducing divergent viewpoints and thus runs counter to the original intent (to determine identity and reduce ambiguity). Adding more information is likely to make description and reference more ambiguous, as Pat Hayes and Harry Halpin have argued in their paper “In defense of ambiguity”. Common sense might tell you that the more you define something, pin it down, the more precise it becomes. But precision is not necessarily achieved by more descriptions – subject identification rather gets prone to ambiguity.

Some more questions about identity and disambiguation:

  • How do you decide that a description has added a dimension to a concept that makes that concept so different that it becomes “another” concept? Who draws the line (someone else might disagree)? Can we formulate criteria for this process, or is it arbitrary?
  • How can we handle time- and context-sensitive descriptions – properties that change over time or according to context? Does a property that changes still define identity, and does identity change too, then? Maybe there are “minor” and “major” properties – if a “major” property changes, does that entail a change of identity, whereas with the change of a “minor”, less typical property identity stays the same? How can either of these “property classes” be determined?

I wonder whether we fall back into the Aristotelian system by stressing description, attributes and properties. Or perhaps description and naming strategies in subject identification can coexist and complement each other.

Interoperability – pragmatic

While syntactic interoperability is a prerequisite for effective data interchange, a whole host of other factors such as the source of the data, the target group and use cases for which it was produced have to be taken into account when re-using pieces of data. Context influences the semantics of data elements and of their content (a certain concept can have different meanings in different knowledge communities). I’d call this level of interoperability “pragmatic”, drawing on a definition of pragmatics as a subfield of linguistics: it “studies the ways in which context contributes to meaning.”

How are data and data elements actually used in context, and used differently in different contexts? Identifiers and identity conditions may vary depending on context. What about incorrect or inconsistent use of data elements? And what is/are the underlying model(s) that enable applications to share data and interpret them correctly in the given domain? If, for example, we don’t agree on the concept “book” and its properties, how can we effectively share and make sense of data about it? Is data still as valuable when torn out of its original context (Linked Data)?

I suggest that data pragmatics could look at the way people use formats, vocabularies etc. in practice as opposed to a theoretical, top-down, prescriptive view. With the number of people creating and publishing data on the web growing at an unprecedented rate, in this “open world”, we are bound to end up with data of varying quality that may or may not be interoperable on all levels.

Crowd-sourcing then and now

The Smithsonian Institution apparently has a long history of crowd-sourcing. David Alan Grier reports in his podcast “The Confident and the Curious” that in the 1850s, the original weather observers collected data for the US Navy. The volunteers sent the data they had gathered with scientific instruments four times a day to the Central Weather Office located in the Smithsonian Institution in Washington D.C. Still today, the Smithsonian makes use of crowd-sourcing to enhance accessibility to their vast collections. Through Flickr, a research fellow at the National Zoo gets help from people in cataloging photographs from wildlife locations.

With the rise of user participation on the web, traditional institutions can no longer claim to have an authoritative view on any given subject. Projects like Linux, Firefox, Wikipedia, OpenLibrary or LibraryThing and professions like journalism testify to this fundamental change. Incidentally, OpenLibrary is thinking about putting out a call for volunteers to help correct bad OCR by transcribing old handwriting.

So what is at the core of crowd-sourcing? People have to be willing to share what they know for a project they perceive as furthering the common good. The mixture of points of view and experience provide a more diverse outlook on the project or topic at hand. Crowd-sourcing entails relinquishing a bit of control, which might be a big step (both psychologically and politically) for some institutions. Could crowd-sourcing be applied to library cataloging too? Libraries could involve experts in certain fields for help with cataloging specific collections that might not have been tackled due to various reasons. This is but one example of how libraries could open themselves to the “wisdom of crowds”.

Identifiers, FRBR and diversity

ELAG 2010 featured a “Workshop on FRBR and Identifiers”. The presentation gives an overview of which identifiers exist for various forms of resources, with special emphasis on FRBR entities, and including a brief look at the role of identifiers in linked data. Just for completeness’ sake, I won’t talk about URL identifiers for FRBR entities and relationships here – a vast topic in and of itself.

Library-created control numbers identify the metadata about the resource, not the resource itself (like ISBNs). For one resource different institutions (publishers, booksellers, libraries) create different identifiers – but how reliable and consistent are they? One ISBN doesn’t necessarily stand for one book only, undermining uniqueness in many cases. As WorldCat data shows (assuming that catalogers correctly recorded the details available), we have a large number of books without ISBNs (which only came into widespread use in the 1970s). Generally there is a considerable percentage of resources which are not identified in a standard way. So the picture is not uniform at all, and some of the established identifiers will have to be reconsidered: the ISBN system is likely going to reach its limits with the proliferation of e-books, and maybe the library world will sometime stop thinking in terms of “records” (possibly with metadata being assembled just in time instead of just in case) – will the LCCN be obsolete then?

There are many efforts of creating and maintaining identifiers in different domains.  Libraries around the world maintain separate authority files (albeit tied together in VIAF) and create separate “records” and thus identifiers for the same resource. It’s important for identifiers to be reused outside their specific areas. Library identifiers have lingered in silos for a long time and are only slowly being adopted by “outside” communities (e.g. German Wikipedia linked identifiers from the National Library’s name authority file with the articles about the respective persons).

A given FRBR work usually has various manifestations which in turn have several identifiers (leaving out the expression level for the moment) – those are the most commonly used (ISBN, LCCN…). OpenLibrary, for one, collocates manifestation identifiers. Topic maps could integrate information from heterogeneous sources on the basis of identifiers. We can probably never achieve global agreement on one unique bibliographic identifier, nor do we have to if we have systems that enable us to consolidate the diversity of identifiers.