OCR Tips: Difference between revisions

Jump to navigation Jump to search
10 bytes added ,  19 October 2012
m
Line 36: Line 36:
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts  
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts  


<br>Fixing errors:  
==== Fixing errors: ====


Tesseract makes characteristic errors. Some of these such as "\/\/" or "\X/" substituted for for "W" can be be globally replaced as it is highly unlikely that they would occur on their own on a label. Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc. For instance, a string containing multiple errors such as "0ct.&nbsp;!Z, ZOlZ" can be programmatically located with a regular expression and changed to "Oct. 12, 2012" or even "12-October-2012" so that it can be entered into a database.  
Tesseract makes characteristic errors. Some of these such as "\/\/" or "\X/" substituted for for "W" can be be globally replaced as it is highly unlikely that they would occur on their own on a label. Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc. For instance, a string containing multiple errors such as "0ct.&nbsp;!Z, ZOlZ" can be programmatically located with a regular expression and changed to "Oct. 12, 2012" or even "12-October-2012" so that it can be entered into a database.  


<br>Misc notes:  
==== Misc notes: ====


Will often recognize vertical text<br> Image input can be tif, jpeg, or gif
Will often recognize vertical text<br> Image input can be tif, jpeg, or gif
4,713

edits

Navigation menu