The conference “Academic Librarian 3: The Yin-Yang of Future Consortial Collaboration and Competition” was held in Hong Kong at the end of last month. Presentations are now available, and I would like to draw your attention to one presentation about cataloging: “From union catalogue to fusion catalogue: how collaborative cataloguing might be initiated and implemented in the Hong Kong context” (PDF). Due to electronic resources and the accompanying vendor records, the union catalog, with its relatively uniform application of rules and standards, gets transformed into a “fusion catalog” with different cataloging rules and various levels of detail. This observation definitely resonates with what I’m dealing with at work right now, namely the integration of thousands of e-book records for an evidence-based selection model set up by one of the big university libraries we serve. The data comes from OCLC, in MARC (and created with a different set of cataloging rules), is subsequently converted into the German / Austrian format MAB and into the Aleph Sequential Format in order to be loaded into our catalog. They are not the “prettiest” records but this is an efficient method of offering the users a large amount of content in a fast way. One more project that brought the Austrian union catalog closer to a “fusion catalog” is the big digitization undertaking by the Austrian National Library, “Austrian Books Online”, where not only books are scanned but also catalog cards which are then OCRed, automatically transformed into bibliographic records and batch-loaded into the catalog database.
So does this new “fusion catalog” with a blended mix of standards, formats, rules and detail affect the user at all? Or is it all hidden under the discovery layer anyway? Do we still really need and can we maintain the high level of consistency of the union catalog? The conference presentation gives some aspects of the lessons learned during the transition from union to fusion catalog, that is sometimes imperceptible to everyone but catalogers:
- Following uniform cataloguing practices
- Preferring a high level of consistency in bibliographic records
- Bring in vendor records applying different cataloguing rules and various level of completeness
- Accepting that ‘a minimal record [is more] beneficial to library users than no record at all’
Variations are inevitable
- The ideal: Conform fully to one single cataloguing standard and to local conventions
- In reality: different cataloguing data sets are blended together
- Direct and immediate access to the needed library materials is more important to users than standard cataloguing records
- When variations are accepted and catalogers are open to accepting differences in cataloguing practices”
With RDA on the horizon and with the perspective of having legacy data and new data sitting side by side, as well as data created following different RDA policy decisions for alternatives/options and cataloger’s judgments, if consortial and/or global shared cataloging is to continue we will finally have to say goodbye to our rather closed world-view and come to terms with a non-uniform, blended mixture of bibliographic information.
Art historians and information systems specialists have been working for two years to make German art sales catalogs (in total about 236,000 art-sale records from more than 1,600 German auction catalogs dating from 1930 to 1945) available online in the Getty Provenance Index. The extensive digitization project was carried out in cooperation with libraries in Berlin and Heidelberg. Read this blog post to learn more about the details of the steps involved: scanning and performing OCR, parsing the data via shell scripts and Perl, hand-editing the data, developing the database and publishing the data as part of the Getty Provenance Index.
At this year’s German library conference there were two presentations about automatic metadata generation. The ZBW (Deutsche Zentralbibliothek für Wirtschaftswissenschaften, German National Library of Economics) catalogs electronic as well as print articles from books and journals. Metadata for these articles are generated automatically from scanned tables of contents. But there is a need to enrich them for various reasons: In order to provide reliable links to electronic versions of articles, identifiers (URLs) and metadata have to be correct. Furthermore, in order to make the data ready for linked data applications or bibliometric rankings, authority control of authors, topics and other entities is key. So automatic metadata generation is a great help in achieving quantity, but quality (human intervention by linking to authority controlled data) is necessary to make the data usable and future-proof (description and slides in German here)
The German National Library reported on their project of automatically extracting metadata from title pages of doctoral dissertations. Since these pages conform to a certain pattern where the same information can be found in the same place on each title page of each thesis, software that can decipher structures according to rules, thesauri and OCR can be used. Here’s a summary of the project in English
and the conference slides in German can be found here.
It’s always interesting to follow the progress and practical examples of automated metadata generation because descriptive cataloging can be supported and accelerated, and human skills can be used for quality management and error assessment instead of manually entering information that can be captured automatically.
Talking about the consequences of self-publishing (by individuals and increasingly by entities like Provincetown Public Library) on the traditional publishing industry, Mike Shatzkin says : “Publishing will become a function of many entities, not a capability reserved to a few insiders who can call themselves an industry.” I wonder if this doesn’t apply to cataloging as well. Libraries used to have a monopoly on cataloging, but increasingly lose this status and find themselves relying on third-party records. Cataloging and metadata have become ubiquitous and are not reserved anymore to those with the arcane knowledge (on LibraryThing anyone can catalog with a simple web interface), but the library world still has a tendency to think we own and can prescribe the “perfect” bibliographic description (which after all is part of our identity and how we define ourselves as an “industry”). Another quote from Shatzkin’s article with parallels to cataloging and the library field: “This is the atomization of publishing, the dispersal of publishing decisions and the origination of published material from far and wide. In a pretty short time, we will see an industry with a completely different profile than it has had for the past couple of hundred years.”
Relationships are key – this is an example of how we did it in the analog age: noting down a review of the book on the back of its catalog card.
Library practices of bibliographic description have so far taken for granted the stability of the book. In the future, we might have to deal with describing versioning, forking and remixing. The article ” Forking the book” argues that dynamic content will become possible. As an example, it highlights a tool that lets you edit EPUB with GIT as a backend. “[W]ith this demo we are using GIT with a book so you can clone, edit, fork and merge the book into infinite versions.” There is already a platform for remixing books, BookRiff, which has not yet gained wide acceptance but which is slated to enable the kind of forking the article talks about.
Data modeling has to be aware of developments in the creation of the objects it primarily describes and makes discoverable. Borrowing expressions from the print paradigm, the forked book is comparable to a kind of “bound with”, multi-work constellation, but more complicated since only parts of works might be used, different versions might be created and licensing information would have to be noted. I guess Bibframe will be able to accommodate these versions and remixes, but that would mean that the statement in the November Bibframe report, “Each BIBFRAME Instance is an instance of one and only one BIBFRAME Work”, will not hold, because, as I see it, the instance (the remixed/forked book) would be in a relationship with two or more works.
The current issue of the Code4Lib journal contains an article by Jeremy Nelson of Colorado College, “Building a Library App Portfolio with Redis and Django”, that highlights the development of FRBR datastores that run on a NoSQL database server (Redis). More interestingly, perhaps, this platform is based on the BIBFRAME model with four core classes: creative work, instance, annotation and authority (for further details, see the project site on GitHub). To me, such a two-level mapping makes a lot of sense. In fact, I quite like the reduction of the FRBR complexity in the BIBFRAME model, especially with the anticipated re-use by other communities in mind. Jeremy Nelson explains on the BIBFRAME mailing list:
Because of Redis’s flexibility, I’ve been able to use RDA element names as either discrete properties for each BIBFRAME entity or as part of the naming scheme for the BIBFRAME entity’s associated keys. A nice feature of this approach is that we are not restricted to just RDA but we can use other metadata standards (MODS, DC, ONIX, VRACore, etc.) as discrete properties or as part of the Redis key naming schema for the BIBFRAME entities. We are also using a simplistic mapping of FRBR Work and Expressions to BIBFRAME Creative Work, along with FRBR Manifestation and Item to the BIBFRAME Instance…