Category Archives: discovery system

The Getty Search Gateway

The J. Paul Getty Trust, consisting of the J. Paul Getty Museum, the Getty Research Institute, the Getty Conservation Institute and the Getty Foundation, has recently launched an exciting new portal, the Getty Search Gateway (see also the press release). It allows you to search and browse the collection database, library catalog, collection inventories and archival finding aids as well as digital collections simultaneously, and filter results using facets. It caught my attention especially because of its similarities to library discovery layers in providing a convenient way to search across collections for a variety of resource formats. Mike Clardy, Assistant Director, Information Systems / Information Technology Services at the Getty, who wrote a blog post to introduce the new research tool, and Joe Shubitowski, Head, Library Information Systems, were kind enough to answer my curious questions and to share some details about the development and underlying structure with me which I’ll paraphrase here.

As you may have guessed, the search gateway was built using the Solr / Lucene search engine. The objective was to bring together a number of sources and formats under one umbrella. This is why the schema definition had to be flexible enough to support the wide variety of contributing sources. In fact, as I learned reading up on the Solr schema, Solr offers ways to dynamically create fields without them being pre-defined or explicitly named. With <dynamicField> declarations, you can create rules that tell the application what to do with certain fields, what data type to use etc. Generally in Solr, fields are strongly typed, i.e. every field in the schema is defined to be of a certain type with specifications about its intended use.

In the case of the Getty Search Gateway, this makes it possible for every source contributor to decide what fields to include in the index, what fields to display (and in which order) and how to label them. More specifically, the Solr schema developed by the Getty staff contains very few required fields, very few mapped fields that all data sources have to map to, and dynamic fields that any source can use to index and display their holdings. A single field may get copied into several different Solr fields, with different field options for searching, sorting, faceting or display, for example. This approach for aggregating museum and library data provides some major facets to pivot on, but also gives each data contributor the freedom to export, index and display the data elements they deem most important. For every data source, custom XSL transformations were written.

The possibility for each source to specify its own options is very powerful and has great potential for other applications. The Solr schema is cleverly exploited in the design of this implementation. I wasn’t previously aware of these possibilities in Solr and really appreciate the chance to understand its inner workings a bit better.

Consistency and identity management

Consistency is a strange thing. We are in dire need of it to give computers something reliable to work with, yet we are unlikely to achieve the necessary level of consistency in our data due to various reasons. First, we are human, and inconsistency can be said to be part of human nature; second, there are different catalogers entering data into the same pool who don’t do things exactly the same. We can (and as catalogers, should) strive for as much consistency as possible in our own work, but factors such as the ones just mentioned get in the way.

Current ILS match strings for indexing, so it’s hard for them to tell whether “Oxford UP” and “Oxford Univ. Pr.” and “Oxford University Press” (I’ll spare you other ways to write this – which exist!) are the same or not. Users wanting to browse titles of a certain publisher are left to click through lists of variant names (typos and such included…). Or even worse, failing such an index, they have to search for all kinds of variations.

Why not cluster / merge these under one term? The technical possibilities are there (the freely available Google Refine, or topic maps, for that matter), I’m sure it could be implemented into library systems. A simple list of values to choose from while cataloging would be another, although limited, option. Here software can help straighten out human errors or inconsistencies (which, let’s face it, will continue to exist) and users will benefit from a more time-sparing and useful display. Identity management, anyone?

Moving Forward – blog

In one of his recent postings to NGC4Lib, Jonathan Rochkind mentioned the University of Wisconsin resource discovery blog Moving Forward. As a cataloging geek, I had to go and check it out ;). If you are keen to learn about the inner workings of a discovery system based on Solr and Blacklight (without too much technical detail, unless you want it), about indexing and searching and the interaction between back-end and front-end, this blog is for you. In particular I enjoy the clear and accessible language of the posts.

Just as an example, let me warmly recommend “Bibliographic Description? Bibliographic Interaction!. Enabling users to combine terms across subject headings empowers them to pursue their own semantic interpretations of subjects – they don’t necessarily need to match the subject strings the cataloger came up with. To be honest, these possibilities of subject browsing are really impressive to me, never having seen such an implementation before. It goes to show that with cleverness and the available technology, some of the rigidity of MARC can be overcome and data can begin to “dance” – not clumsily but elegantly.