Image Selection and Processing Protocols: Difference between revisions

m
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Decisions about Image Sets and How to Parse and Process the Data ==
=== Decisions about Image Sets and How to Parse and Process the Data ===


=== OCR Images ===
==== OCR Images ====
*Batch of 10,000 images per set.
*Batch of 10,000 images per set.
*200 selected from the 10,000 that will serve as the Gold and Silver standards
*200 selected from the 10,000 that will serve as the Gold and Silver standards
Line 25: Line 25:
*Compression: none to minor (as lossless as possible)
*Compression: none to minor (as lossless as possible)


=== Processing for gold and silver images ===
==== Processing for gold and silver images ====
*200 Hand Typed Transcriptions (Gold)
*200 Hand Typed Transcriptions (Gold)
**Transcription of the label text as close to what is on the label
**Transcription of the label text as close to what is on the label
Line 45: Line 45:
**Generated from raw Tesseract (?) OCR output of the same images used for the gold
**Generated from raw Tesseract (?) OCR output of the same images used for the gold
**Text should not be corrected
**Text should not be corrected
==== Specific Standard Parsing Decisions ====
:::;verbatimCoordinates: do not include the words '''latitude''' and '''longitude''', just the values with a space between (do not add a comma).
:::;verbatimEventDate: enter just as is on label
:::;eventDate and dateIdentified: use yyyy-mm-dd format. Use yyyy if only the year, use yyyy-mm if you have Feb. 1990, ...
:::;host and habitat: for our purposes, we were collecting the habitat field. If host data is present on the label, it was parsed into the habitat field. Please put the data into this field and do not add a comma between the host and habitat info, just a space. Also please put the host and  habitat information into this field in the same order as they appear on the label.


Back to the [[2013 AOCR Hackathon Wiki]]
Back to the [[2013 AOCR Hackathon Wiki]]
4,713

edits