OCR / NLP Workflows: Difference between revisions

From iDigBio
Jump to navigation Jump to search
(OCR / NLP Workflow & Protocol documents)
 
No edit summary
Line 1: Line 1:
OCR / NLP Workflow & Protocol Documents
OCR / NLP Workflow & Protocol Documents
Barber, A.C, Lafferty, D., & Landrum, L.R. ''in press''. The SALIX Method: A semi-automated workflow for herbarium specimen digitization. Taxon.
Abstract. Supported by a United States American Recovery and Reinvestment Act grant, we have developed a workflow, “the SALIX Method,” to image, database, and provide web access to ca. 60,000 Latin American plant specimens housed at the Arizona State University Herbarium. The SALIX Method incorporates optical character recognition using ABBYY FineReader and uses other proprietary software for word processing (Microsoft Word) and image management (Adobe Lightroom). We developed the other applications ourselves: SALIX for text parsing, and BarcodeRenamer (BCR) for renaming image files to match their barcodes. We use our Symbiota data portal (SEINet) to provide web access to collections data and images. Data entry was found to be about as fast to considerably faster using the SALIX Method than by keystroke entry directly into SEINet. Speed is dependent on label quality and length as well as user proficiency.

Revision as of 16:09, 22 April 2013

OCR / NLP Workflow & Protocol Documents

Barber, A.C, Lafferty, D., & Landrum, L.R. in press. The SALIX Method: A semi-automated workflow for herbarium specimen digitization. Taxon.

Abstract. Supported by a United States American Recovery and Reinvestment Act grant, we have developed a workflow, “the SALIX Method,” to image, database, and provide web access to ca. 60,000 Latin American plant specimens housed at the Arizona State University Herbarium. The SALIX Method incorporates optical character recognition using ABBYY FineReader and uses other proprietary software for word processing (Microsoft Word) and image management (Adobe Lightroom). We developed the other applications ourselves: SALIX for text parsing, and BarcodeRenamer (BCR) for renaming image files to match their barcodes. We use our Symbiota data portal (SEINet) to provide web access to collections data and images. Data entry was found to be about as fast to considerably faster using the SALIX Method than by keystroke entry directly into SEINet. Speed is dependent on label quality and length as well as user proficiency.