The German National Library has been working on a project for automatic subject classification that is expected to go live at the end of this year. This project, called PETRUS, is explained in detail in this article (in German). Part of the Library’s mission is the creation of the German National Bibliography, and to keep pace with the onslaught of (digital) resources more efficiency is needed. The main areas where human intellectual work is supported by machines in the PETRUS project are indexing, classification, metadata extraction and metadata generation.
At the moment, the project covers full-text electronic / online publications in German and English. Subject headings from the SWD (Schlagwortnormdatei) and DDC numbers are automatically assigned to the resources. I assume when the project becomes operational more will be reported on precision rates. To be sure, the more full text is available, the more material there is for the computer to process and learn from and the more successful automatic classification will be.
Machine-learning and linguistic as well as semantic analysis are at the heart of the PETRUS project. The Library collaborates with Averbis, a company that has already built several information retrieval platforms on the basis of Lucene, e.g. Medpilot for the German National Library of Medicine. Their technology makes use of various NLP methods for assigning classification numbers and headings from subject authority files.
The German National Library will pursue a two-pronged approach to cataloging digital materials: publishers and other suppliers of publications are encouraged to already include metadata when submitting resources to the Library as legal deposit. This metadata can be reused, repurposed or iteratively enhanced. Second, the Library will make use of automatic tools to extract and/or generate metadata from machine-processable data. Human intervention will only take place on specific occasions, for quality control or clarification of difficult cases.