Image Selection and Processing Protocols: Difference between revisions

m
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Decisions about Image Sets and How to Parse and Process the Data =
=== Decisions about Image Sets and How to Parse and Process the Data ===


== OCR Images ==
==== OCR Images ====
*Batch of 10,000 images per set.
*Batch of 10,000 images per set.
*200 selected from the 10,000 that will serve as the Gold and Silver standards
*200 selected from the 10,000 that will serve as the Gold and Silver standards
Line 7: Line 7:
*Herbarium labels (full sheets)  
*Herbarium labels (full sheets)  
**NYBG will supply 5000 and select 100 gold
**NYBG will supply 5000 and select 100 gold
***The selection criteria for the 5000 images was:
***The selection criteria for the 100 Gold was:
**BRIT will supply 5000 and select 100 gold  
**BRIT will supply 5000 and select 100 gold  
***The selection criteria for the 5000 images was:
***The selection criteria for the 100 Gold was:
*Packet labels
*Packet labels
**CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
**CNALH (Ed Gilbert) will supply 10,000 lichen images and select 200 gold
*Entomology labels
***The selection criteria for the 5000 images was:
***The selection criteria for the 100 Gold was:
 
*CalBug provided 523 Entomology labels
**There were 523 initial images
***The selection criteria for the 523 images was:
***The selection criteria for the 199 Gold was:
*Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
*Primary typed labels should be the target. Some hand writing mixed in with the text is OK, and even preferable for a small portion of images. These images will produce “noise” that is more realistic for our situation.
*Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.  
*Images should be JPGs. If you have TIFFs or another format, you can make those images available within another folder.  
*Compression: none to minor (as lossless as possible)
*Compression: none to minor (as lossless as possible)


== Processing for gold and silver images ==
==== Processing for gold and silver images ====
*200 Hand Typed Transcriptions (Gold)
*200 Hand Typed Transcriptions (Gold)
**Transcription of the label text as close to what is on the label
**Transcription of the label text as close to what is on the label
Line 35: Line 45:
**Generated from raw Tesseract (?) OCR output of the same images used for the gold
**Generated from raw Tesseract (?) OCR output of the same images used for the gold
**Text should not be corrected
**Text should not be corrected
==== Specific Standard Parsing Decisions ====
:::;verbatimCoordinates: do not include the words '''latitude''' and '''longitude''', just the values with a space between (do not add a comma).
:::;verbatimEventDate: enter just as is on label
:::;eventDate and dateIdentified: use yyyy-mm-dd format. Use yyyy if only the year, use yyyy-mm if you have Feb. 1990, ...
:::;host and habitat: for our purposes, we were collecting the habitat field. If host data is present on the label, it was parsed into the habitat field. Please put the data into this field and do not add a comma between the host and habitat info, just a space. Also please put the host and  habitat information into this field in the same order as they appear on the label.
Back to the [[2013 AOCR Hackathon Wiki]]
4,713

edits