Art historians and information systems specialists have been working for two years to make German art sales catalogs (in total about 236,000 art-sale records from more than 1,600 German auction catalogs dating from 1930 to 1945) available online in the Getty Provenance Index. The extensive digitization project was carried out in cooperation with libraries in Berlin and Heidelberg. Read this blog post to learn more about the details of the steps involved: scanning and performing OCR, parsing the data via shell scripts and Perl, hand-editing the data, developing the database and publishing the data as part of the Getty Provenance Index.
The HathiTrust Research Center (HTRC) “is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.”
Here’s a video that details HTRC’s mission of supporting scholars (e.g. in the digital humanities) in their research:
I would be particularly interested in learning more about a project mentioned 2.45 minutes into the video that involves “automatically enhancing the metadata that describes the volumes”, ultimately resulting in higher quality metadata – maybe we’ll hear more about it in the future.
via CDLINFO News
The article “Cataloging Then, Now, and Tomorrow” cites three trends in cataloging: “the increasing reliance on vendor-supplied records and services, the explosion of electronic resources, and the growing interrelatedness of local library catalogs with systems outside the library.”
Well, I’m excited about getting to address the first two at work. I was able to slightly shift the focus of my role and am now one of two people responsible for managing the automated cataloging of vendor/publisher-supplied ebook data. After retrieving the data packages via FTP, we run shell scripts to modify them according to our needs, to load them into the ILS and to create holdings. There are some plans to support our consortium members with patron-driven acquisitions, and I’ll be involved in that project, too.
So I now have a foot in both worlds – the traditional cataloging world of one item (or one card) at a time, and the world of using “power tools” to manipulate large quantities of metadata without having to touch each record.
Imagine a user wants to read a public-domain book in electronic form. She’d be faced with the same situation as users before the advent of unified resource discovery systems – she has to go to various places on the web and do separate searches. Wouldn’t it be nice if there was a meta catalog for digitized works that brings together data from the likes of the Internet Archive, HathiTrust, Project Gutenberg, Europeana or Google Books? It could show what books were digitized by whom, whether they are downloadable, in what format, on what devices they can be read etc. Such a directory could also enable users to compare the quality if the same work is available in different versions. Another benefit would be the reduction of duplications of effort. Having duplicate electronic versions is not necessarily bad, but are time and money not better spent on unique materials not digitized elsewhere? Local priorities could be determined on a more informed basis.
All of this occurred to me while reading an article about the eBooks-on-Demand (EOD) service discovery platform (from p. 229 here, in German). EOD is a joint initiative of over 30 libraries from 12 European countries that each run their own digitization activities. Together they offer the (paid) service that lets users order a public-domain book to be digitized and delivered as an ebook. Instead of relying on users discovering EOD books “by chance” in the respective libraries’ catalogs, a VuFind search interface was built that allows finding books for digitization from all participating libraries in one central place and gives direct access to already digitized items. Records are ingested via OAI or FTP batch upload. For the future the project team plans to enhance the search platform to include links (via API queries of players like those I mentioned above) to works already digitized elsewhere. And this is where the idea of a central overarching catalog for digitized public-domain works popped up. Existing portals such as the Zentrales Verzeichnis digitalisierter Drucke (ZVDD, central catalog of digitized printed works, which covers digital versions created in Germany) go into the right direction, but we definitely have to think more globally and on a larger scale.
HathiTrust has launched a new blog, Perspectives from HathiTrust. The first post is by John Wilkin, Executive Director. He describes the strategy HathiTrust follows to make its content discoverable. Integration into the broader bibliographic access landscape, i.e. making the digitized material findable in a number of environments, is a central mission. Major points:
- The temporary catalog based on VuFind will be retired and replaced by a new OCLC WorldCat Local catalog (press release here). Inclusion of HathiTrust content into a database where many libraries manage their collections emphasizes the role it can play in collection management and analysis not only for partner libraries but for the broader community. Of course, APIs and record distribution via OAI are also important for access to HathiTrust content.
- Much of this content is already in Google Books and Internet Archive, but HathiTrust wants to open more avenues for discoverability by incorporating its full text indexes into the Summon discovery service (see press release). Availability in similar tools is likely to follow.
- The standalone HathiTrust full-text search service will be enhanced with new features such as faceting or weighting of results.
Metadata, not only for digital libraries but also for regular library catalogs, is increasingly going to be collected from heterogeneous sources in various formats. With a plethora of metadata standards around, what could a framework “that likes any metadata it sees” (Roy Tennant) look like?
If you’d like to learn more about how metadata travels from harvesting from different sources to aggregated presentation, one good place to start would be “Strategies for reprocessing aggregated metadata” by Muriel Foulonneau and Timothy W. Cole (PDF, published in Research and Advanced Technology for Digital Libraries (proceedings of the 9th European Conference on Digital Libraries, ECDL 2005), Heidelberg: Springer, ISBN 3-540-28767-1).
The OAI protocol facilitates the aggregation of large numbers of heterogeneous metadata records. In order to make harvested records useable in the context of an OAI service provider, the records typically must be filtered, analyzed and transformed. The CIC metadata portal harvests 450,000 records from 18 repositories at 9 U.S. Midwestern universities. The process implemented for transforming metadata records for this project supports multiple workflows and end-user interfaces. The design of the metadata transformation process required trade-offs between aggregation homogeneity and utility for purpose and pragmatic constraints such as feasibility, human resources, and processing time.
In the process of looking at transitioning from MARC, I hope the Library of Congress will also consider an infrastructure that is designed to facilitate linking, ingestion and reprocessing of a variety of metadata from other information providers. Maybe it really is time for a meta model?
The German National Library has been working on a project for automatic subject classification that is expected to go live at the end of this year. This project, called PETRUS, is explained in detail in this article (in German). Part of the Library’s mission is the creation of the German National Bibliography, and to keep pace with the onslaught of (digital) resources more efficiency is needed. The main areas where human intellectual work is supported by machines in the PETRUS project are indexing, classification, metadata extraction and metadata generation.
At the moment, the project covers full-text electronic / online publications in German and English. Subject headings from the SWD (Schlagwortnormdatei) and DDC numbers are automatically assigned to the resources. I assume when the project becomes operational more will be reported on precision rates. To be sure, the more full text is available, the more material there is for the computer to process and learn from and the more successful automatic classification will be.
Machine-learning and linguistic as well as semantic analysis are at the heart of the PETRUS project. The Library collaborates with Averbis, a company that has already built several information retrieval platforms on the basis of Lucene, e.g. Medpilot for the German National Library of Medicine. Their technology makes use of various NLP methods for assigning classification numbers and headings from subject authority files.
The German National Library will pursue a two-pronged approach to cataloging digital materials: publishers and other suppliers of publications are encouraged to already include metadata when submitting resources to the Library as legal deposit. This metadata can be reused, repurposed or iteratively enhanced. Second, the Library will make use of automatic tools to extract and/or generate metadata from machine-processable data. Human intervention will only take place on specific occasions, for quality control or clarification of difficult cases.