lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Игорь Абрашин <vjiaste...@gmail.com>
Subject Re: OCR image contains cyrillic characters
Date Sat, 11 Feb 2017 05:54:02 GMT
Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.

10 февр. 2017 г. 15:28 пользователь "Rick Leir" <rleir@leirtech.com>
написал:

> My guess is that you are using using Tika and Tesseract. The latter is
> complex, and you can start learning at
>
> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF
>
> The traineddata for Cyrillic is here:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> https://github.com/tesseract-ocr/tesseract/issues/147
>
> You likely need to enhance the images before running Tesseract.
>
> cheers -- Rick
>
> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>
>> Hello, community!
>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>> inside?
>> Ive got only latin letter (looks like ugly translite text) in result for
>> that moment.For image contains only lattin letters it works fine.
>> Does anyone have any suggestion, best practice or case studies refer to
>> this situation?
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message