Clean Images

Clean Images -- Higher Quality OCR

With the arrival of e-books almost all publishers are racing to transform hardcopy publications. The convenience of reading on a small hand-held device called a eBook Reader is unmatched. There are several eBook Readers in the market like Amazon's Kindle, Sony's Reader, Barnes & Noble's Nook etc.

Most eBook Readers support the .epub format. Which means hardcopy books need to be converted to this format. The first step in the conversion is to scan a book to an image format like JPG, TIF etc.

The image is then converted to a formatted text file with embedded graphics. This conversion is cheapest done using an OCR software ( ABBYY's FineReader is a good example). The accuracy of this OCR process is not very high and every little speck interspersed in the text block is read as a punctuation by the software. Smudges maybe read during the OCR as a word which would be irrelevant to the text.

Prior to running the OCR if the scanned images are cleaned the resultant OCR would be of a higher quality. Cleanup of images can be done using any standard image editing software. My favourite is IrfanView and it's free! Image cleanup would involve some processes like deskew, despeckle, adjust contrast/brightness which can be run as a batch on all the images. In addition to the batch processing every page image needs to be viewed for inconsistencies and corrected manually.

This involves huge labor costs if done in-house, but we could do it at a fraction of that cost. CyberData India a company with over 15 years of experience has worked on millions of images, cleaning them and converting to text. For more information visit us at edatashop.com.