Advancing the Catalogue of the World’s Natural History Collections

Reposting: Advancing the Catalogue of the World’s Natural History Collections from GBIF under the auspices of the Alliance for Biodiversity Knowledge and the SYNTHESYS+ project.
Release date: 2020-02-24 v1.0

Citation: Hobern D, Asase A, Groom Q, Paul D, Robertson T, Semal P, Thiers B & Woodburn M (2020) Advancing the Catalogue of the World’s Natural History Collections. v1.0. GBIF Secretariat: Copenhagen. https://doi.org/10.15468/doc-wnsx-ep77.

Contributors

Donald Hobern, Catalogue of Life | International Barcode of Life
Alex Asase, University of Ghana | GBIF Ghana
Quentin Groom, Meise Botanic Garden
Deborah Paul, iDigBio | TDWG CD Interest Group
Tim Robertson, GBIF Secretariat
Patrick Semal, Royal Belgian Institute of Natural Sciences | CETAF
Barbara Thiers, New York Botanical Garden | Index Herbariorum
Matt Woodburn, Natural History Museum, London | TDWG CD Interest Group

Additional contributors to subsequent versions will be credited here.

Licence

The document Advancing the Catalogue of the World’s Natural History Collections is licensed under Creative Commons Attribution 4.0 Unported License.

Persistent URI

https://doi.org/10.15468/doc-wnsx-ep77

Document control

v1.0, 24 February 2020

Cover image

Stenoptilia pterodactyla, collected by Aare Lindt in Kübassaare, Estonia, 2010. Photo 2016 PlutoF via Estonian Natural History Museum, licensed under CC BY-NC 4.0.

Background

Information about natural history collections helps to map the research landscape and assists researchers in locating and contacting the holders of specimens. Collection records contribute to the development of a fully interlinked biodiversity knowledge graph, showcasing the existence and importance of museums and herbaria and supplying context to available data on specimens. These records also potentially open new avenues for fresh use of these collections and for accelerating their full availability online.

This document explores ideas for improved global collaboration to build, maintain and use a comprehensive catalogue of the world’s natural history collections. Each idea is presented as a separate topic with a set of questions to guide discussion within the online consultation, Advancing the Catalogue of the World’s Natural History Collections.

Over the last few decades, the field of biodiversity informatics has developed to include researchers and informaticians from all over the world, collaborating to bring together knowledge of the world’s species and ecosystems in a readily usable form.

The focus of biodiversity informatics has largely been on species and other taxa (including their names, diagnostic characters and traits), natural history specimens (including information on their collection in the field, their measurements, images, sequences, etc.), and field observations (including information on occurrence, distribution and abundance surveys, monitoring activities, citizen science, genomics and many other sources). These elements together help to address two fundamental challenges in biology: characterising the set of species with which we share the planet, and understanding the changing distribution, co-occurrence, interactions, and dynamics of these species in space and time.

The biodiversity informatics community has also given attention to other categories of information that support these primary elements, especially through efforts to digitise the vast literature on taxonomy and biodiversity and work to develop a comprehensive catalogue of the world’s natural history collections, including museums, herbaria and a range of specialised collections.

These collections are the repository for materials from centuries of international investment to collect, document, study and describe species. Specimens and other materials held in these collections anchor our understanding of evolution and contemporary diversity. They provide the bridge between historical knowledge and continuing efforts to describe life on Earth. Many of their holdings are truly irreplaceable or give otherwise irrecoverable insights into past distributions and ecology. Information on the collections themselves is an important tool for accessing, enriching and using them.

Many established use cases for standardised collection information relate primarily to preserved biological collections. This paper treats these collections as its core focus. However, we hope that the consultation will also explore two other closely related contexts: 1) geological collections (often held and managed by the same institutions as biological collections) and 2) living collections (overlapping significantly with the subject matter and research uses of preserved biological collections). We welcome inputs that address this wider scope.

How to respond to this Ideas Paper

Read the sections below and contribute to developing a roadmap for collaborative activity to build the catalogue.

We welcome contributions as follows:

Do you represent a stakeholder, project, database, tool, standard or community that addresses some aspect of the topics outlined here, or do you have ideas for novel approaches to mobilise or use information on collections?
- Please contact Donald Hobern by 3 April 2020 to contribute significant ideas or examples that will add value to the online discussions.
- We welcome short documents or slide presentations that can be shared on the consultation website.
- If presentations are unlikely to be clear without further explanation, please modify the slides or consider supplying the presentation as a pre-recorded video with audio commentary.
- Please keep all materials brief and focused, so that a reader or viewer could assimilate the ideas within fifteen minutes or (ideally) less.
Would you like to understand more about the consultation or to suggest possible additions to the scope outlined here?
- Please register for one of the preparatory webinars (06:00 UTC on 12 March 2020 or 15:00 UTC on 13 March 2020).
Would you like to contribute to the online discussions for the consultation?
- Please register to join the consultation community on the GBIF Discourse site.
- Discussions will take place between 17 and 29 April 2020.
- We will keep you informed as more information is added to the site and ensure that you receive regular updates during the consultation
Will you be able to expand the relevance of the consultation by translating short summary updates into languages other than English?
- We expect to circulate regular short summaries (a few paragraphs every day or two) to all participants during the main consultation period to keep the discussion focused, summarise agreement, and highlight new ideas and questions.
- We welcome assistance in translating these into languages that will make it easier for all participants to follow the discussions and know how to contribute.
- Please contact Donald Hobern if you are interested in helping.

1. Uses for the catalogue

The TDWG Collection Description Interest Group has collected use cases for natural history collection information from several major stakeholders.

1.1. A directory to support the collections community

Collections staff and taxonomists collaborate as a truly global community. Valuable specimens are distributed between institutions in all parts of the world. Researchers visit these collections or borrow specimens as part of their work. Index Herbariorum (IH) is the directory of information on the world’s herbaria (addresses, contacts, specialties, size, etc.). It is a well-managed resource and highly regarded as a tool by the botanical community. No full equivalent exists globally for other natural history collections, although national/regional infrastructures such as the ALA collections pages, the iDigBio US Collections List, and the CETAF profiles serve similar roles. GBIF has recently integrated the Global Registry of Scientific Collections (GRSciColl) into its registry as a framework that can be extended with richer information curated by collections communities.

Q1. Would the collections community benefit from a comprehensive directory of all natural history collections? Who would make use of such a directory? (The focus here is on the catalogue as a directory of known institutions and information required to contact them.)

1.2. Locating specimens and genetic materials

Taxonomic studies and other research projects normally depend on researchers (or their contacts) knowing which institutions hold relevant specimens or other materials. This is complicated by the history of expeditions and collecting activities. Specimens have been scattered across all continents. Only a small proportion of these specimens have been databased in forms that can be accessed through GBIF or other portals. A catalogue providing at least summary information on taxonomic and geographic scope for each collection could assist researchers in locating relevant materials.

Q2. Would summary information on every collection’s materials be a useful tool? Who would use this information? What is the minimum level of information (and what is ideal) to support these users?

1.3. A first step towards databasing collections

The information needed to build the catalogue of collections closely matches the metadata required to publish a specimen dataset to GBIF and other portals. A collection record could be treated as a minimal first step, perhaps leading through processes such as Join The Dots and onwards to comprehensive digitisation. A comprehensive catalogue of such records could guide efforts to prioritise further digitisation, by highlighting collections with holdings of particular relevance or by assisting the development of collaborative digitisation networks like the ADBC Thematic Collection Networks.

Q3. Can publishing a collection record to a catalogue assist collections in moving towards full digitisation? What incentives or support do collections need to make this a worthwhile step?

1.4. Assessing the scale and value of collections

Estimates of the number of specimens held by collections run into billions, but no definitive number exists. A catalogue could help to narrow these estimates and to assess the economic value of these irreplaceable holdings. This information may help to justify the scale of effort and funding needed to digitise collections and make their data accessible for universal and persistent use.

Q4. Would more accurate estimates of the scale and value of collections be useful? How might these be used and by whom?

1.5. Increased value for data on specimens, taxonomic publications, etc.

Accurate information on any collection can be used as a reference or as linked data associated with specimen records and other data objects. Users of specimen records need contextual information about the the collection that holds the specimen, for example to communicate with collection managers about individual specimens, to offer corrections to specimen data, or simply to determine whether the collection is likely to hold quantities of similar specimens. It may be inefficient to embed all of this information within the specimen record. Holding a single authoritative copy assists with keeping the collection information current. The collection record may also contain information on taxonomic or geographic scope or other aspects that can resolve potential ambiguities within a specimen record. Links to current collection records will also enhance taxonomic publications referencing their materials. This is particularly important because catalogue numbers and other specimen identifiers used in publications may not link to digitised information on the specimens. Linking to the collection simplifies future access and may enable digital links to be inferred in future.

Q5. How could a comprehensive collections catalogue contribute to improvements to other categories of biodiversity data? What requirements would these improvements place on the catalogue?

1.6. Reducing duplication of effort

Although no complete catalogue of collections exists, the need for such information leads to such data repeatedly being published in different formats for different portals, project documentation, metadata for other data, etc. This duplication results in confusion as outdated information remains on the web. Mechanisms that always link to a single continuously updated version would address these issues.

Q6. Can we identify savings in time and costs that would arise from a well-managed shared catalogue of collections?

1.7. Foundation for new and enriched services

A comprehensive directory could serve as a foundation for new tools that enhance taxonomic efforts and cooperation between all collections. One example might be the development of distributed loans systems or on-demand digitisation, as planned for the DiSSCo European Loans and Visits System (ELViS). A catalogue could also serve as a showcase for institutions to highlight their holdings and unique features, as in the visual concept shared by GBIF for collection pages. GBIF tracking and reporting on the use of biodiversity data in research publications could feed into new services that provide standard metrics and help collections to measure and report their impact.

Q7. What other services could be developed on the foundations of a collection catalogue? Would these attract investment to fund the development and support the maintenance of the catalogue?

2. Information in the catalogue

We need to develop a shared vision for the content that the catalogue should hold and how it interlinks with other information products.

2.1. Definition of “natural history collection”

The scope for the catalogue needs to be defined. This involves agreeing what should be counted as a natural history collection and what should be excluded. Practices in this regard vary across the community. Within IH, each herbarium record usually corresponds to an institution with its own unique collection code, street address, etc. Within zoology, museums are often structured as a set of collections with differing and possibly hierarchical taxonomic scope. Specimens collected on famous expeditions or by significant researchers may have their own identity. Other categories that may require consideration include living collections (microbial collections, zoos, aquaria, botanic gardens, seed banks), specially managed collections (e.g. tissue collections, DNA repositories, slide banks, xylaria), university and personal collections.

Q8. What is the definition for our purposes (minimal and sufficient criteria) of a natural history collection? How do collections relate to and differ from 1) institutions, 2) datasets and 3) collecting events (e.g. expeditions)? Are collections (and collection records) hierarchical? If so, how do parent-child relationships work, and do we infer information from parent to child or vice versa? What identifier schemes (IH collection codes, GRSciColl URIs, etc.) already exist and need to be maintained in some form? Do these schemes follow a consistent definition of a natural history collection?

2.2. Description of a collection

The TDWG Collection Descriptions (CD) Interest Group is currently developing the CD standard for collection descriptions (evolving from the earlier TDWG Natural Collections Description (NCD) standard). Existing networks and institutional schemes use a variety of different formats or variants of metadata standards for their collection records, as a result of which interoperability between these resources is limited. To overcome this barrier, clarity is needed around factors such as preferred standards and vocabularies, mandatory fields and compatibility between information in different formats.

Q9. What descriptive information should be considered mandatory or desirable for each Collection? Does the TDWG CD work supply everything needed? Otherwise, what enhancements are necessary? How much of this information needs to be normalised for machine processing (rather than just for human readers)?

2.3. Wider data linkages

Information in the collection catalogue may be linked to a wide range of other biodiversity information (specimens, sequences, datasets, images, publications, etc.) to support information access and exploration.

Q10. What information should be linked to collection records? We should focus on making linkages that will actually justify the costs of creating and maintaining them. The following are likely to be candidates, but others are possible. In each case, we should determine whether the linkage needs to be bidirectional:

Specimens held by a collection
Type specimens held by a collection
Species/taxa represented in a collection (with/without specimen counts)
Sequences, images and other preparations from the collection (but these may be better treated as information about specimens rather than about the collection)
Datasets (checklists, occurrences, sampling events) associated with the collection
Collecting expeditions carried out by or contributing to the collection (modeled as sampling events?)
Collectors associated with a collection
Publications based on materials from the collection
Researchers/staff associated with the collection
Field notebooks

2.4. Information services relating to collections

The main value from the collection catalogue may appear in the information services that can be offered around the information managed. Considering these services may help to clarify the content requirements.

Q11. What do we want to do with the catalogue, beyond having clean and comprehensive linked open data about each collection? The following potential services are likely to be candidates, but others are possible. In each case, would the service depend on a partnership with other digital repositories (e.g. BHL, GBIF, CoL)?

Assess the growth, scale and value of the world’s collections
Discover the location of biological materials or the likely presence of biological materials for any taxon
Develop discovery services for accessing information on type specimens or communicating with the relevant collection where the specimen is not digitised
Identify sections of collections that should be digitised to answer specific questions
Match gap analysis of published specimen data against the collection catalogue to prioritise digitisation for filling taxonomic, geographic, or other gaps.
Discover holdings that make a particular collection unique, and therefore of even higher value
Develop and fund collaborative digitisation programmes focused on understanding of the holdings of the network as a whole
Develop cross-institutional loan systems and taxonomic workbenches
Develop citation models for collections and track their impact
Perform risk assessment of the health or stability of a collection

3. Technology for the catalogue

A wide range of different tools are already in use for authoring collection metadata, curating partial catalogues such as IH, GRSciColl and national collections pages. These vary in their technical capabilities and sustainability. Some are well supported by existing communities and could form part of an interconnected solution. A goal for this consultation is to identify which components are mature and stable and can contribute to such a solution and to identify what other components may need to be developed.

3.1. Pathways and tools for publishing collection records

Existing information on collections is edited and maintained in different ways. IH allows herbaria to provide or edit their records and offers support for herbaria to provide updates via email or other channels. Other communities such as national portals have other pathways for collections to provide or update information. Several tools help data publishers to create EML metadata for publishing data to GBIF and elsewhere. These could evolve to deliver collection records in preferred formats. The Integrated Publishing Toolkit (IPT) could be enhanced to offer collection records as one of the core record types that can be shared. This would allow collections either to publish one or more collection records as a small standalone dataset or collection networks to manage and publish a dataset comprising many collection records. Wikidata could also serve as a tool or platform for editing catalogue information and making it widely accessible and reusable.

Q12. Which existing tools, databases and websites can help to mobilise and maintain collection records? Is it possible to identify additional tools or pathways that need to be developed or supported?

3.2. Community catalogues

IH is the best established catalogue servicing a large community of collections, but many other communities are important, including regionally or nationally focused efforts, such as CETAF’s institutional profiles, the web portals of iDigBio and the ALA, and the One World Collection initiative, and thematically aligned efforts, such as the World Directory of Culture Collections and the Global Genome Biodiversity Network portal. A comprehensive global catalogue should ensure that the needs of these different communities are met and support their continued operation and independence wherever is valued by collections. Understanding these requirements is essential in planning the technical implementation and governance of the catalogue.

Q13. What catalogues already address the needs of some communities of collections? How can an integrated catalogue support these communities? Which communities require a separately branded identity and/or platform? What is the best way to include these communities as part of an interconnected solution? Is there a role for content to be created and improved by a wider audience (e.g. through Wikidata)?

3.3. Integrated catalogue

GBIF has the mission to provide global-scale support for biodiversity informatics solutions and has expanded its Registry to host the data historically maintained as GRSciColl. GRSciColl content is incomplete and is best seen as a framework for expansion with richer collection metadata that properly represents the needs and interests of collections. GBIF can serve as the context for integration and deduplication of collection information from different sources and for interlinking this information for other biodiversity data. GBIF requires guidance on the best way to support the needs and branding of collections and their communities as it develops such services.

Q14. Are there issues with GBIF providing hosting and support for the global catalogue through its Registry? What is required to ensure that this meets the needs of collections and is fully adopted and owned by the collections community? What challenges need to be addressed to minimise duplication of content and effort within an integrated catalogue?

3.4. Collection management systems

Most natural history collections maintain data on their specimens in a collection management system (CMS) such as Specify, Symbiota, EMu, DarWIN or BRAHMS. Some of these tools could develop to interface directly with the collection catalogue, providing up-to-date metadata and metrics.

Q15. What present or future requirements are there for interfaces directly between CMS platforms and the collection catalogue? Are there special opportunities that should be considered? Could CMS platforms become a source of metadata for institutional collections within a global catalogue?

3.5. Interfaces, APIs and client modules

The value of a shared commons-based resource can be maximised by ensuring that interfaces and APIs support the needs of all key stakeholder groups, including addressing issues around content delivery to the fullest extent possible in multiple languages. Some needs may be addressed by offering reusable client components that can be embedded in other applications.

Q16. What interfaces and APIs are required to maximise access to the collection catalogue? How can the catalogue best support diverse user communities, including speakers of different languages?

4. Governance of the catalogue

Standards and tools are only one part of the solution. For the catalogue to succeed and provide value, it must be accepted by and deliver value to the stakeholders it represents, in particular the collection-holding institutions and the communities that support collections. It is important to identify the stakeholders that need ownership for each aspect of the collection building and to understand how they can be enabled, empowered, and resources to take on these responsibilities. Mechanisms are also needed to deal with situations in which needs or interests may come into conflict.

4.1. Ownership of information for each collection

The basic assumption is that each institution should have primary responsibility and control for information on its collections. However, it may be appropriate to delegate full or partial responsibility to thematic, regional or national communities that have data curators able to ensure the quality and standardisation of collection records. In some contexts, where institutions have for any reason not provided authoritative information, or do not have the resources to do so, there may be reason to allow or encourage a wider user base to contribute and improve collection records. In all cases, a version history is required for the information, so that users can understand and respond to changes made by others.

Q17. How should ownership and access control for collection records be managed? How should appropriate editors be recognised and validated? Are there situations where automated or human intervention will be required to resolve disagreements or discrepancies?

4.2. Communities of practice

Communities such as IH, CETAF, ALA, iDigBio, etc. play an important role supporting collections and promoting standards-based practices. In many cases, these communities have a high level of understanding and participate closely in the development of biodiversity informatics solutions. Their roles and rights need to be well defined and supported in any integrated solution.

Q18. What do these communities require to be able to carry out their work efficiently and support their collections? How can an integrated approach enhance their offerings? What risks need to be addressed?

4.3. Technical infrastructures

Biodiversity information infrastructures such as GBIF, DiSSCo, iDigBio and other national and regional platforms are usually funded in the context of broader open science goals for research infrastructures. Their participation can provide an important bridge between the needs of the collection communities and funding and expertise for informatics solutions. Roles and responsibilities must however be well defined to ensure that the needs of researchers and user communities are central. It is important to define clearly how these technical infrastructures can best participate in the overall solution, including demonstrating the benefits required to secure sustained funding for an integrated catalogue and for all the component parts.

Q19. What technical infrastructures need to be engaged as part of the solution? How are their roles and needs best balanced with those of the collections and of their communities?

4.4. Governance arrangements

A complex, commons-based solution will depend for its long-term success on a governance model that provides confidence to all parties that their interests are served and protected. The model should find the right balance between ensuring the health of the collaboration and minimising associated overheads in terms of meetings, reporting, etc.

Q20. Are there appropriate models that can be adopted or expanded to support the governance of this catalogue? Can it be managed in the context of an existing organisation or institution?

4.5. Incentives for contributors

Relatively little effort may be required for each institution to register and manage its own collection records. However, the stability of the system will depend on continued effort from these institutions or from other parties to correct errors and outdated information. There should be clear benefits or incentives encouraging stakeholders to contribute this effort. A key goal should be to ensure that the catalogue contributes usefully for the work of collection managers and taxonomists. Acknowledgement of contributions may also be valuable.

Q21. What are the incentives for different contributors to maintain information in the catalogue? How can these be maximised?

4.6. Funding

Funding needs will depend on other aspects of the approach adopted to build the catalogue. Costs will be higher if more central support is required to maintain the content. Even if the content is largely managed for free by the international community, sustaining a reliable infrastructure requires effort and long-term investment (see for example the CoreTrustSeal model for trusted repositories).

Q22. How can the governance and technical aspects be funded? Is external funding likely? What other models may be feasible (contributions from collections, inclusion within the funded mission for GBIF or some other host)?