Flowing Smoothly into Europeana: the OpenUp! Technical Workflow

Do you know how the content from the collections actually reach the Eropeana portal and could be then demonstrated for example as our content highlight? Today we would like to explain to you, the OpenUp! technical architecture and content workflow.

One of the main tasks in the OpenUp! project is to harvest the standardised metadata of multimedia objects of natural history data providers and to transform this data into the Europeana schema. The transformed data is aggregated in the OpenUp! Metadata Database of the Europeana Natural History Aggregator established by the OpenUp! project and subsequently handed over to Europeana (Berendsohn & Günstsch 2012).

Data or metadata?

We need to explain our view on the term, ‘data’ and ‘metadata’ in the OpenUp! project. For example: Natural History domain data is included in the metadata for multimedia objects (physical object information). Metadata usually refers to the technical data of a multimedia object, e.g. aperture, camera type, etc. However, Europeana calls domain data (= records) related to the physical object by the term ‘metadata’, and features this associated metadata along with the digital object. This metadata is distributed under CC zero licence in Europeana – under the full control of the provider. Only a minimum set of mandatory concepts is required for Europeana (fig 1).

The OpenUp! architecture is divided in  two integral parts. The first part addresses the data provision, including the set-up of the BioCASe Provider software and the mapping to the domain standard ABCD (Access to Biological Collections Data) and its extension EFG (Extension for Geosciences). The second part is the Europeana Natural History Aggregator which builds the OpenUp! Metadata Database, assures the transformation of the domain standard ABCD (EFG) to the Europeana standard ESE (Europeana Semantic Elements) and enables the harvest by Europeana.

The overall OpenUp! to Europeana (technical aggregation) workflow consists of seven major steps that are visualized in the following graphic (p. 19) and described  bellow.

Workflow Description:

Content provider and coordination (Steps 1–3):
The technical set-up for data provision in OpenUp!  can be used   /is used to provide data to the GBIF network.

Step1: Domain standard ABCD and its extension EFG

As the first step the multimedia object associated metadata of the provider (collection data) is mapped to the ABCD domain standard (zoology and botany) and its extension EFG (palaeontology, mineralogy and anthropology). The mapping to the ABCD standard is carried out using the BioCASe Provider Software. Finally the BioCASe Provider Software serves as a web-interface for providing the data for harvesting.

Step2 (optional): Data quality check

Before harvest (Step 4), providers can check their data with the Data Quality Toolkit, which provides a service for automated testing of their data quality, e.g. conformity of the data or check of scientific names against reference services. After testing the data, providers can apply necessary changes in their source data or in mapping between the database and the BioCase Software tool.

Step3: Compliance check and monitoring of data provision

Providers can check their mapping and the correctness of the used concepts in the BioCASe Monitor Service by attaching their data source access point URL to this URL. Sample values for each concept are displayed and concept values are counted on demand, which helps detecting inconsistencies or incorrect use of concepts according to the ABCD documentation.

Furthermore, the BioCASe Monitor provides a compliance check for Europeana and displays error messages if mandatory concepts for the ABCD to ESE transformation are missing. The providers should assure they have a functional data source and correct mapping before requesting a test-harvest. The OpenUp! Helpdesk provides documentation and technical assistance for the setup of the BPS and the ABCD (EFG) mapping, and assists the providers in troubleshooting, in close collaboration with the BioCASe Helpdesk and the GBIF team.

The progress in content provision is monitored in the BioCASe Monitor Service by the coordination teams of the content providing Work Packages 4 & 5 in OpenUp!.

Step4: (Test) Harvest

Once the mapping is quality checked by the coordination teams of the content providing Work packages and the OpenUp! Helpdesk, a test harvest with the GBIF Harvester, the HIT (Harvesting and Indexed Toolkit), is initiated. Test results and valid content is communicated back to the provider in order to allow for further adjustments. Technical problems encountered during test-harvest are fixed in collaboration with AIT, the OpenUp! Helpdesk and the BioCASe Helpdesk team. A harvest of the entire data source is initiated after successful completion of the test-harvest and confirmation by the provider.

The data provider can check the visualization of their content in Europeana by the Europeana Content Checker tool. This tool is also used by the WP coordination for a final quality check and to detect issues in the display of data/content in Europeana. Encountered problems in display of the data in the Europeana portal not related to the data provided are communicated back to the Europeana.

Step5: HIT Harvest

The HIT Harvester stores bulks of ABCD (EFG) records into the central aggregator OpenUp! metadata database. This database stores only the metadata, including the URLs of the multimedia data.

Step6: ABCD (EFG) transformation to ESE

The metadata from the ABCD (EFG) standard used by the natural history domain are transformed into ESE, which is used as a cross-domain metadata standard in Europeana. The transformation is carried out using Pentaho Data Integration (Penthaho Kettle). The mapping tool picks up the metadata, transforms them and stores them in a metadata database.

Step7: OAI-PMH and Europeana harvest

The metadata are periodically harvested by Europeana via a single OAI-PMH (The Open Archives Initiative Protocol for Metadata Harvesting) access point at the metadata database. Previews of multimedia objects for presentation and queries in the Europeana portal are generated by Europeana from full object URLs given in metadata. This is the final step in the workflow when providing data in the flat ESE standard.

This is the actual implemented technical workflow. We will publish updated workflow including the semantic enrichment and EDM transformation in next newsletters and on our Blog. Stay tuned!

Berendsohn W. G., Güntsch A. (2012): OpenUp! Creating a cross-domain pipeline for natural history data. In: Blagoderov V., Smith V. S. (Eds) No specimen left behind: mass digitization of natural history collections. ZooKeys 209: 47–54.

Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith