Monthly Archives: August 2010

The Recovery Act and topic maps

Improving Federal Spending Transparency: Lessons Drawn from” by Raymond Yee, Eric C. Kansa and Erik Wilde of the UC Berkeley School of Information “explores the effectiveness of accountability measures deployed for the American Recovery and Reinvestment Act of 2009 (‘Recovery Act’ or ‘ARRA’).” Although data has been released as part of Open Government initiatives, the authors point out a lack of transparency due to data silos, highly distributed information sources and lack of controlled access points, among other reasons.

According to the authors, ARRA data resembles a jigsaw puzzle – the legislation is complex and there are many players and sources of data. In my view topic maps could help with a number of problems cited in the paper: they could build a bridge between several budgetary disclosure systems, they could expose the structure behind ARRA and make explicit the relationships between legislation and the wishes of Congress, implementation by the Treasury Dept., allocation of money to different accounts, and spending patterns (including agencies and recipients). Links could go back and forth, connecting data from across agencies (e.g. spending data –> program documentation –> legislation authorizing funding for that program). Obviously, machine-processable and unambiguous identifiers as well as controlled vocabularies are needed for various entities – this seems to be a weakness in the data so far, though.

The authors also call for an account of the data sources, which can be “first-class citizens” in topic maps, i.e. topics in their own right that can be talked about. Moreover, they stress the importance of efficient information retrieval systems – if you can’t find the information, what use is access to data? Budgetary metadata of high quality is critical to findability and useful display.

Classification would also be conducive to discovery, keeping in mind that “… classification is not necessarily an objective process. It is shaped by the assumptions and goals of people and organizations. These worldviews and goals often see disagreement and evolve over time.” Topic maps have mechanisms to reflect changes in terminology without discarding older terms, and different views of the world can coexist and be indicated by scope.

Access to data doesn’t automatically imply transparency and findability. The increasing number of Open Government efforts (so far primarily in the U.K. and the U.S.) look like a great opportunity for topic maps.


Identifiers, FRBR and diversity

ELAG 2010 featured a “Workshop on FRBR and Identifiers”. The presentation gives an overview of which identifiers exist for various forms of resources, with special emphasis on FRBR entities, and including a brief look at the role of identifiers in linked data. Just for completeness’ sake, I won’t talk about URL identifiers for FRBR entities and relationships here – a vast topic in and of itself.

Library-created control numbers identify the metadata about the resource, not the resource itself (like ISBNs). For one resource different institutions (publishers, booksellers, libraries) create different identifiers – but how reliable and consistent are they? One ISBN doesn’t necessarily stand for one book only, undermining uniqueness in many cases. As WorldCat data shows (assuming that catalogers correctly recorded the details available), we have a large number of books without ISBNs (which only came into widespread use in the 1970s). Generally there is a considerable percentage of resources which are not identified in a standard way. So the picture is not uniform at all, and some of the established identifiers will have to be reconsidered: the ISBN system is likely going to reach its limits with the proliferation of e-books, and maybe the library world will sometime stop thinking in terms of “records” (possibly with metadata being assembled just in time instead of just in case) – will the LCCN be obsolete then?

There are many efforts of creating and maintaining identifiers in different domains.  Libraries around the world maintain separate authority files (albeit tied together in VIAF) and create separate “records” and thus identifiers for the same resource. It’s important for identifiers to be reused outside their specific areas. Library identifiers have lingered in silos for a long time and are only slowly being adopted by “outside” communities (e.g. German Wikipedia linked identifiers from the National Library’s name authority file with the articles about the respective persons).

A given FRBR work usually has various manifestations which in turn have several identifiers (leaving out the expression level for the moment) – those are the most commonly used (ISBN, LCCN…). OpenLibrary, for one, collocates manifestation identifiers. Topic maps could integrate information from heterogeneous sources on the basis of identifiers. We can probably never achieve global agreement on one unique bibliographic identifier, nor do we have to if we have systems that enable us to consolidate the diversity of identifiers.

Open access interdisciplinary textbook

Robert B. Allen, College of Information Science and Technology, Drexel University, has a draft textbook entitled Information: A Fundamental Construct available open access at It provides “a broad overview of informatics, information science, and information systems”, treating topics like knowledge representation, human cognition, natural language, text and human language technologies, to name but a few. A good way to brush up what we already know or to learn something new!


Source-sensitive facet?

To follow up on the last post, picking up Ranganathan’s law “Save the time of the reader”, one way of saving the time of the reader is to refine faceted search and browsing. The bigger the indexed corpus (and that would be the case when including abstracts, TOCs, indexes etc.), the more hits a query will yield, and there will potentially be a higher number of irrelevant results. It is also getting more difficult for the user to see why a certain result is returned if the search term doesn’t show up in the readily identifiable fields like title, author or subject heading. We don’t considerably save the user’s time if they have to wade through pages of search results, just because we don’t want them to miss a book they might find useful that came up due to the search term in the back-of-the-book index.

The discovery layers that more and more replace traditional library OPACs offer faceting of results by various criteria (language, format, year of publication etc.). How about introducing what I would call a “source-sensitive facet”? This facet will show where the search term occurs, whether it’s in the metadata, the subject heading(s), or in supplementary material such as TOCs, abstracts or indexes. Scottsdale Public Library has such a facet in place:

Their “Search found in” facet is apparently generated from the metadata proper, but you could also imagine indicating the location (e.g. an electronic document associated with a given item) where the search term was found: “Search found in TOC / abstract / index “. The document in question has to be typed for the system to recognize which type (TOC, abstract, index) it belongs to and display this information. Such a facet would make the results more transparent, since some users are confused about where their search term occurs in the data and why a specific item turns up in the list of results. Those interested enough to find out will have a tool at hand.

The value of the index

Library metadata will increasingly include (some or all of) the content of the resource cataloged to complement the descriptive data. It is currently hard for libraries to capture what is in a book (granted, we have subject headings, but they categorize resources using controlled rather than natural language), and as a consequence some books are prone to being easily overlooked.

A friend of mine is doing research on two hardly known female Austrian composers of the 1930s for her Master’s thesis. In one instance, she only found out that a book contained important biographical information about them through the book’s index on the publisher’s website listing their names (which she stumbled across via … Google). Obviously neither a table of contents nor a library record could have supplied this pointer. So, failing a full-text scan, digital versions of the parts of a book that serve as windows into the content are helpful additions to the metadata. This is not restricted to TOCs (which can already be found in a number of records), but includes indexes as well. Incidentally, it is not without a reason that the Topic Maps technology was originally developed for automated electronic indexes – they are gateways into a book’s contents, reflecting the main concepts it deals with. All of the above is assuming, of course, that an index exists – as Baron Campbell says in Lives of the Chief Justices:

“So essential do I consider an Index to be to every book, that I proposed to bring a Bill into Parliament to deprive an author who publishes a book without an index of the privilege of copyright; and, moreover, to subject him, for his offence, to a pecuniary penalty.”

(quoted from: Thomas B. Passin: Explorer’s Guide to the Semantic Web)

If publishers or other providers of bibliographic metadata are ready to make additional material about their publications available for display in library catalogs (so that efforts aren’t duplicated), if reviews, abstracts, indexes or TOCs are assembled in one central place, indexed and thus searchable, we stand a much better chance to help users discover the knowledge structure contained in books. Moreover, by not requiring them to go to the shelf to look inside the book or even request it from closed stacks, we get a step closer to fulfilling Ranganathan’s fourth law: Save the time of the reader.