OCR Tips: Difference between revisions

Jump to navigation Jump to search
502 bytes added ,  2 October 2012
No edit summary
Line 46: Line 46:
<br>What to look out for:
<br>What to look out for:


Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts.
 
<br>Fixing errors:
 
Tesseract makes characteristic errors.  Some of these such as "\/\/" or "\X/" substituted for for "W" can be
be globally replaced.  Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc.  For instance, "0ct. !Z, ZOlZ" can be located with a regular expression and changed to "Oct.12, 2012" so that it can be entered into a database.


<br>Misc notes:
<br>Misc notes:


Will recognize vertical text<br> Image input can be tif, jpeg, or gif<br>
Will often recognize vertical text<br> Image input can be tif, jpeg, or gif.
4

edits

Navigation menu