Wednesday 7 August 2013

OCR

I was wondering if there were any easy to use OCR tools for Linux, turns out there is tesseract, originally developed by HP and now maintained by Google (I guess they are using it for their digitisation projects).

To use it simply pass it an image file and the file you want the text to be written to e.g.
tesseract <image file> <output>

I first tried with a png image but that did not produce anything, looks like tiff seems to work so I converted the png image using gimp.

This could be useful to script up something with some scanning software (like scanimage) to automaticlly produce a text file of a scanned document.

I will post more on this once I have played around with it some more.