From iDigBio - About the Cisco Pit Stop: Digitising the Natural History Museum’s collections

For optimum results, digitization of collections needs to go faster, right? Of course, this includes addressing data quality and completeness. Companies periodically review the hardware and software they use because technological advances offer businesses the opportunity to do more for less money, and save time, reach more people, and improve and connect data in new ways. (see Moore's law https://en.wikipedia.org/wiki/Moore's_law, for example)

For digitization of collections and publishing, the same thing is true. We need a periodic review of what we're doing, and we need new insights and new ideas from new collaborators. The physical objects themselves may change slowly through geologic time, but collecting practices, preservation methods, data standards, data publishing, and data sharing methods – may evolve, can evolve at a much faster rate. To speed up development of collections digitization - and publishing, what activities and models might speed up discovery and dissemination of new methods?

Sackler Biodiversity Imaging Laboratory Manager

A Possible Example - a CISCO Pit Stop.
Note, the Natural History Museum London (NHM) is attempting to digitize 20 million (M) specimens in 5 years as part of their Digital Collections Programme (see also http://www.nhm.ac.uk/our-science/our-work/digital-museum.html). At their current digitization (data capture and imaging) rate, it would take about 100 years to capture 20 M specimens worth of data. So, what to do? One, starting with a known rate of image/data capture, the NHM knows they must capture at least 2000 images/day to meet their 5-year goal. Two, a skills assessment of staff (researchers, curators, collection managers and technical assistants) revealed what courses and content might be offered to potentially speed up human data and image capture skills and improve data quality at the same time. For at least some of this capacity building, the NHM is using Data Carpentry course materials and workshops.

And three, the NHM found a way to look for innovation, for key digitization challenges, through the eyes of some new potential collaborators. Working in concert with CISCO, the NHM partnered with the UK Digital Catapult Centre (DCC) to invite small to middle-size enterprises (SMEs) to a 2-day Pit Stop meeting to learn about the NHM’s 5 year 20 M specimen challenge. On 25-26 February, 67 people gathered at the NHM and the DCC for 2 days of brainstorming on how to speed up collections digitization. Day 1 began at the NHM with talks from CISCO, the NHM's Vince Smith, and Rod Page, followed by behind-the-scenes collection tours from Vladimir Blagoderov in the Sackler Biodiversity Imaging Lab, and Sandra Knapp in the Herbarium. Considering that the museum collections and digitization worlds were completely new to many of these organzations, I was thrilled to be asked to start day 2 by giving an overview of some of the specific challenges facing collections trying to digitize. So, one particularly nifty (yes, nifty!) part of this event included an artist, John McKeever (on twitter @GoodStick) whose job it is to animate the story from a person’s talk. John makes a powerpoint come alive and take the form of a narrative in illustration format. How cool is that? You can literally see what I talked about.

SMEs were selected from those who answered a Call for Participation (CFP) put together by the DCC, NHM and CISCO. Participants were invited to participate in the CISCO Pit Stop based on evaluating the SME’s for best match to the NHM 3 main challenges to be addressed. Interesting (but perhaps not surprising) to note that, so far, most of the SMEs answering the CFP found the data / text mining tasks most intriguing and addressable with the skills and tools they bring to our digitization data capture, mobilization, and imaging challenges. Next, from the DCC website, the Pit Stop focused on three key challenges as follows:

  1. Specimen metadata transcription: exploring efficient ways of capturing text at scale. ...Potential solutions to improve efficiency could include OCR for printed tags and HTR (Handwriting Transcript Recognition). NHM manually digitised records will be available if required for training data.
  2. Data quality enhancement of legacy data and output from any transcription processes: looking at data cleaning methods and how these can be automated. This could incorporate NLP/machine learning/text processing for analysis; it might also be addressed by using gamification/crowd sourcing over time.
  3. Text/data/literature mining and linking: enhancing the value of specimen data by extracting the semantic information and linking to relevant data be it in published literature archives or online and provide new metrics to track the impact and use of digital collections.

Who participated? Here's a partial list of participants (from the NHM Blog): Cisco; SciFabric; Singular Intelligence; Sparrho; Warwick Analytics; Wikidata; Xerox; Zegami; Zeutschel; Springer Nature; The British Library; University of Cambridge; University of Glasgow; University of Sussex; National Biodiversity Network; Pantar; Pensoft; Picturae; Restore; Axiell, Royal Botanic Garden Edinburgh; Royal Botanic Gardens Kew, iDigBio, and others.

What happens next?
At the end of the Pit Stop, the participants were invited to write proposals that target addressing one or more of the Pit Stop challenges. CISCO and the NHM plan to choose at least 2 project proposals to fund. The expected time-to-product is one year. We're still waiting to hear which proposals get selected. We should know by SPNHC 2016.

What's next at iDigBio?
At iDigBio, we’ve just started discussing the idea of doing something like this on our side of the ocean. Got ideas you'd like to share? Do you know of private enterprises that want to participate, to collaborate? How might we get more automation / robotics / imaging solutions people at the table? Have you seen this initiative? Biological Collections as a Resource for Technical Innovation and http://www.bist.centers.vt.edu/.

What / Who is the DCC?
The UK Digital Catapult Centre, funded by Innovate UK (a UK government program https://www.gov.uk/government/organisations/innovate-uk), fosters relationships between incubators, accelerators, funders, and between the academic, public, and private sector.

Why CISCO?
CISCO actively looks for innovative ideas and new data. The NHM and museum's collections and visitor information offer CISCO new data and new opportunities for helping the academic sector meet its research and development, education and outreach goals. These kinds of collaborative events increase visibility for CISCO, for everyone. They allow key players to share stories, that in-turn, raise awareness of current issues and challenges, spark ideas for change and experimentation, and advance new ideas and processes.

Thanks for reading! If you've got questions, or would like to discuss this further, please contact us. You can find me as @idbdeb on twitter, and email dpaul AT FSU dot EDU

If you got this far,  you must be hooked, ...


So, You Want to READ Even More About It? (links to more posts by hosts, sponsors, and participants)