OCR Tips: Difference between revisions

69 bytes added ,  2 October 2012
m
Line 46: Line 46:
<br>What to look out for:
<br>What to look out for:


Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts.
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts


<br>Fixing errors:
<br>Fixing errors:


Tesseract makes characteristic errors.  Some of these such as "\/\/" or "\X/" substituted for for "W" can be  
Tesseract makes characteristic errors.  Some of these such as "\/\/" or "\X/" substituted for for "W" can be  
be globally replaced.  Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc.  For instance, "0ct. !Z, ZOlZ" can be located with a regular expression and changed to "Oct. 12, 2012" so that it can be entered into a database.
be globally replaced as it is highly unlikely that they would occur on their own on a label.  Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc.  For instance, "0ct. !Z, ZOlZ" can be located with a regular expression and changed to "Oct. 12, 2012" so that it can be entered into a database.


<br>Misc notes:
<br>Misc notes:


Will often recognize vertical text<br> Image input can be tif, jpeg, or gif.
Will often recognize vertical text<br> Image input can be tif, jpeg, or gif
4

edits