lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Leir <rl...@leirtech.com>
Subject Re: OCR image contains cyrillic characters
Date Fri, 10 Feb 2017 10:28:20 GMT
My guess is that you are using using Tika and Tesseract. The latter is 
complex, and you can start learning at

https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF

The traineddata for Cyrillic is here:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

https://github.com/tesseract-ocr/tesseract/issues/147

You likely need to enhance the images before running Tesseract.

cheers -- Rick

On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
> Hello, community!
> Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
> Ive got only latin letter (looks like ugly translite text) in result for
> that moment.For image contains only lattin letters it works fine.
> Does anyone have any suggestion, best practice or case studies refer to
> this situation?
>


Mime
View raw message