packaging University museum collections information as part of open metadata provision
August 9, 2012Posted by on
This post from the project’s technical partners, describes some of the technical challenges faced within Culture Grid (CG) and the new search interfaces developed for the Contextual Wrappers 2 project (CW2).
This project follows on from the Contextual Wrappers project (CW) that was previously funded by JISC. The CW project tackled several problems surrounding representation of Collection Level Descriptions (CLDs) and the linking of these CLDs to relevant item records. However, it also led to several new issues being highlighted. CW2 aimed to tackle these issues and develop the ideas that were previously investigated within both CG and CW.
In situations such as CG where records are combined from over 100 different museums, libraries and archives it is useful to provide an overall collection hierarchy which allows different sets of records from different museums to be grouped together or separated out as appropriate. This allows the aggregating platform (CG) to apply extra management information to the individual collections that would not normally be made available publically. This information can include whether to expose the data through different API endpoints, etc.
CG has traditionally allowed for very simple ‘grid’ collection descriptions for these groupings along with the relevant management data. It then allowed for the uploading of highly detailed CLDs based on RSLP metadata elements for individual museum’s collections. CW2 takes the lessons learnt from CW and previous projects and apply them to improving the management of collections within CG. This in turn improves the user experience allowing them to navigate between associated collections within the user interface, helping them to find records more easily.
Another issue that CW2 aims to tackle is caused by the increasing use of application profiles as implemented by the CW project. The CW work allows different records within CG to have different application profiles applied to them. This means that records will have data specified related to varying metadata fields. The existing CG interfaces simply output all fields for which there was data to the user. While this deals with the fact that different records use different fields it doesn’t lead to the best user experience as fields aren’t necessarily shown in useful orders, or with appropriate grouping, etc. CW2 aims to solve this problem by ensuring that the application profiles that are used within the backend CG code are exposed to the frontend interface systems allowing the information to be displayed in a more consistent manner.
What’s so hard about that?
In retrospect there’s nothing hard about these developments. However, as always when working with a system that needs to deal with almost 3 million records without breaking existing functionality, the devil is in the details! Much of the functionality that has been added for CW2 could easily be implemented into a new aggregation system which doesn’t already contain that much data as the data can then be made to fit the system on import. This isn’t possible with a system such as CG which has had to develop organically throughout its lifetime.
The frontend interfaces for CG are completely separate from the backend processing components. While this has advantages in terms of allowing backend components to be upgraded without adversely affecting the frontend and vice versa it means that application programming interfaces (APIs) must be developed to share information between the components. For CW2 it has been necessary to develop mechanisms to share the application profile specification between the backend and frontend. The standard mechanism for sharing application profiles is to use an XML Schema. While this is perfectly sensible for data validation and associated purposes it is too cumbersome for use when creating user interfaces from the structure. Therefore a custom JSON structure had to be developed which allowed the frontend interfaces to structure data as specified by the profiles.
Improved links from collections to institutions and between ‘institutions’
CG also holds information about many of the cultural institutions within the UK since it preserves an institution dataset inherited from a prior project. CG has always modelled the link between institutions and the CLDs describing the collections held by these institutions. However, this information was not exposed in enough detail in order to be used by the user to navigate between different collections of records. CW2 has improved this by firstly extending the application profile implementation to include the description of institution records and then by ensuring that enough information is provided to allow navigation between different collections held by the institution.
The initial data model for institution records within CG was based around the data structures that were inherited. This didn’t include links between institutions. This lack of linking can be problematic in the case of institutions such as the University of Cambridge and the Fitzwilliam Museum (FM) which are linked but distinct institutions in real life. Collections held by FM need to somehow be associated with both FM and potentially University of Cambridge. A similar situation exists with records from services like VADS and COPAC in CG, where there is a need to associate records with both the service and the institution(s) their data represents. Therefore CW2 has extended the CG institution data model to provide links between ‘institutions’ allowing the user to perform this sort of navigation between records.
Automatic collection data
CG holds both metadata representations of collections (CLDs) and of the items themselves within the collections. This means that it is possible for CG to perform some automatic analysis of the item metadata and to save this information as part of the CLD itself. This in turn allows the user to navigate collections based on information stored in the items within the collection which the original producer of the CLD may not have included.
CG performs this function by periodically searching for important information within the items stored in a collection and then caching that information within the collection itself in a way that it can be presented to the end user. In theory this information could also be used as a controlled vocabulary of sorts to help the user when creating new records within that collection within CG but this functionality is not within the scope of the CW2 project.
Positive lessons from CW2 include that it is possible to link together management information and CLDs in order to provide a more user friendly search and navigation experience. Neither the CLD specification, nor the specification of the management information within CG needed to be adapted for use in this way; it was simply a case of linking the two together and then making the data available as appropriate. The main search APIs provided by CG didn’t need any expanding to meet the needs of the CW2 project although additional API calls were required to expose application profile information, etc.
The main negative lesson learnt is that once again we underestimated the amount of time required to implement the desired changes to a system as large and complex as CG. The CG platform has evolved incrementally throughout its existence. This means that there are often complex interactions between different parts of the platform which makes changing any one small part of the system more complex than would be the case with a system developed from scratch to fulfil the same requirements. It also means that further incremental changes often interact and conflict with these previous interactions in unforeseen ways, leading to increased development and testing times compared with forecasts. Happily, though, we’ve managed to get through the pain and implement something that we hope will prove useful for all CG users.