Art historians and information systems specialists have been working for two years to make German art sales catalogs (in total about 236,000 art-sale records from more than 1,600 German auction catalogs dating from 1930 to 1945) available online in the Getty Provenance Index. The extensive digitization project was carried out in cooperation with libraries in Berlin and Heidelberg. Read this blog post to learn more about the details of the steps involved: scanning and performing OCR, parsing the data via shell scripts and Perl, hand-editing the data, developing the database and publishing the data as part of the Getty Provenance Index.
At this year’s German library conference there were two presentations about automatic metadata generation. The ZBW (Deutsche Zentralbibliothek für Wirtschaftswissenschaften, German National Library of Economics) catalogs electronic as well as print articles from books and journals. Metadata for these articles are generated automatically from scanned tables of contents. But there is a need to enrich them for various reasons: In order to provide reliable links to electronic versions of articles, identifiers (URLs) and metadata have to be correct. Furthermore, in order to make the data ready for linked data applications or bibliometric rankings, authority control of authors, topics and other entities is key. So automatic metadata generation is a great help in achieving quantity, but quality (human intervention by linking to authority controlled data) is necessary to make the data usable and future-proof (description and slides in German here)
The German National Library reported on their project of automatically extracting metadata from title pages of doctoral dissertations. Since these pages conform to a certain pattern where the same information can be found in the same place on each title page of each thesis, software that can decipher structures according to rules, thesauri and OCR can be used. Here’s a summary of the project in English
and the conference slides in German can be found here.
It’s always interesting to follow the progress and practical examples of automated metadata generation because descriptive cataloging can be supported and accelerated, and human skills can be used for quality management and error assessment instead of manually entering information that can be captured automatically.
Talking about the consequences of self-publishing (by individuals and increasingly by entities like Provincetown Public Library) on the traditional publishing industry, Mike Shatzkin says : “Publishing will become a function of many entities, not a capability reserved to a few insiders who can call themselves an industry.” I wonder if this doesn’t apply to cataloging as well. Libraries used to have a monopoly on cataloging, but increasingly lose this status and find themselves relying on third-party records. Cataloging and metadata have become ubiquitous and are not reserved anymore to those with the arcane knowledge (on LibraryThing anyone can catalog with a simple web interface), but the library world still has a tendency to think we own and can prescribe the “perfect” bibliographic description (which after all is part of our identity and how we define ourselves as an “industry”). Another quote from Shatzkin’s article with parallels to cataloging and the library field: “This is the atomization of publishing, the dispersal of publishing decisions and the origination of published material from far and wide. In a pretty short time, we will see an industry with a completely different profile than it has had for the past couple of hundred years.”
Library practices of bibliographic description have so far taken for granted the stability of the book. In the future, we might have to deal with describing versioning, forking and remixing. The article ” Forking the book” argues that dynamic content will become possible. As an example, it highlights a tool that lets you edit EPUB with GIT as a backend. “[W]ith this demo we are using GIT with a book so you can clone, edit, fork and merge the book into infinite versions.” There is already a platform for remixing books, BookRiff, which has not yet gained wide acceptance but which is slated to enable the kind of forking the article talks about.
Data modeling has to be aware of developments in the creation of the objects it primarily describes and makes discoverable. Borrowing expressions from the print paradigm, the forked book is comparable to a kind of “bound with”, multi-work constellation, but more complicated since only parts of works might be used, different versions might be created and licensing information would have to be noted. I guess Bibframe will be able to accommodate these versions and remixes, but that would mean that the statement in the November Bibframe report, “Each BIBFRAME Instance is an instance of one and only one BIBFRAME Work”, will not hold, because, as I see it, the instance (the remixed/forked book) would be in a relationship with two or more works.
The current issue of the Code4Lib journal contains an article by Jeremy Nelson of Colorado College, “Building a Library App Portfolio with Redis and Django”, that highlights the development of FRBR datastores that run on a NoSQL database server (Redis). More interestingly, perhaps, this platform is based on the BIBFRAME model with four core classes: creative work, instance, annotation and authority (for further details, see the project site on GitHub). To me, such a two-level mapping makes a lot of sense. In fact, I quite like the reduction of the FRBR complexity in the BIBFRAME model, especially with the anticipated re-use by other communities in mind. Jeremy Nelson explains on the BIBFRAME mailing list:
Because of Redis’s flexibility, I’ve been able to use RDA element names as either discrete properties for each BIBFRAME entity or as part of the naming scheme for the BIBFRAME entity’s associated keys. A nice feature of this approach is that we are not restricted to just RDA but we can use other metadata standards (MODS, DC, ONIX, VRACore, etc.) as discrete properties or as part of the Redis key naming schema for the BIBFRAME entities. We are also using a simplistic mapping of FRBR Work and Expressions to BIBFRAME Creative Work, along with FRBR Manifestation and Item to the BIBFRAME Instance…
The HathiTrust Research Center (HTRC) “is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.”
Here’s a video that details HTRC’s mission of supporting scholars (e.g. in the digital humanities) in their research:
I would be particularly interested in learning more about a project mentioned 2.45 minutes into the video that involves “automatically enhancing the metadata that describes the volumes”, ultimately resulting in higher quality metadata – maybe we’ll hear more about it in the future.
via CDLINFO News