Presentations & Reports: Difference between revisions

Jump to navigation Jump to search
Line 4: Line 4:


::;Hackathon Metrics - Alex Thompson
::;Hackathon Metrics - Alex Thompson
::;Parsing Dataset 1 - Daryl Lafferty
::;Parsing Dataset 1 - Daryl Lafferty
::;[http://manuscripttranscription.blogspot.com/2013/02/improving-ocr-inputs-from-ocr-outputs.html Improving OCR Inputs from OCR Outputs] - Ben Brumfield: Efforts to improve the quality of OCR by pre-processing images based on the output of 'naive' OCR execution.  Topics included handwriting detection within Dataset 1 ([http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html final report]) and label extraction from Dataset 3 ([http://manuscripttranscription.blogspot.com/2013/02/results-of-ocrocrop-approach-to.html final report]).
::;[http://manuscripttranscription.blogspot.com/2013/02/improving-ocr-inputs-from-ocr-outputs.html Improving OCR Inputs from OCR Outputs] - Ben Brumfield: Efforts to improve the quality of OCR by pre-processing images based on the output of 'naive' OCR execution.  Topics included handwriting detection within Dataset 1 ([http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html final report]) and label extraction from Dataset 3 ([http://manuscripttranscription.blogspot.com/2013/02/results-of-ocrocrop-approach-to.html final report]).
::;Image Segmentation - Phuc Nguyen
::;Image Segmentation - Phuc Nguyen
::;[https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-hackathon/HackathonPresentation.doc Parsing Dataset 1] - Robert Anglin
::;[https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-hackathon/HackathonPresentation.doc Parsing Dataset 1] - Robert Anglin
:::The Lichens, Bryophytes and Climate Change (LBCC) project endeavors to digitize the label information from millions of North American lichen and bryophyte herbarium specimens. These labels are occasionally hand written although usually at least partially typed or printed. When typed or printed a myriad of fonts have been used. Thus far, we have been constrained to use open-source Optical Character Recognition (OCR) software. By most accounts the best of these is Tesseract. While it does not recognize handwriting it is considered to be the best at recognizing typewritten text images. Our workflow involves attaching a barcode to labels, imaging them, assigning the barcode to be the image filename and submitting the images to an FTP server to be entered into Symbiota, a MySQL database designed by Ed Gilbert, with the barcode as the catalog number. Once entered into Symbiota, the images are batch processed with Tesseract and the results entered into the database as well. The next step is to parse the Tesseract output to retrieve the label information in hopes of being able to populate relevant fields in the database. I have been developing code using PHP version 5.3 and its PCRE regular-expression library for this purpose.
::;LabelX - Bryan Heidorn & Qianjin Zhang
::;LabelX - Bryan Heidorn & Qianjin Zhang
::;Parsing Dataset 2 - Dmitry Mozzherin
::;Parsing Dataset 2 - Dmitry Mozzherin