lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Leir <>
Subject Re: OCR image contains cyrillic characters
Date Sat, 11 Feb 2017 14:44:06 GMT
Yes, you are right. I was just trying to help, and did not have time to dig out the details.
So the question is: how do you tell Solr to pass the language arg to Tika and Tesseract? 

On February 11, 2017 12:54:02 AM EST, "Игорь Абрашин" <>
>Hi, Rick.
>I didnt mean that he need to train, because tesseract works well
>So, tika included in solr doesnt try to use russian dict to recognize
>cyrillic text and result comes up utilize only eng alphabet.
>10 февр. 2017 г. 15:28 пользователь "Rick Leir" <>
>> My guess is that you are using using Tika and Tesseract. The latter
>> complex, and you can start learning at
>>   <--shows you how to work with
>> The traineddata for Cyrillic is here:
>> You likely need to enhance the images before running Tesseract.
>> cheers -- Rick
>> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>>> Hello, community!
>>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>>> inside?
>>> Ive got only latin letter (looks like ugly translite text) in result
>>> that moment.For image contains only lattin letters it works fine.
>>> Does anyone have any suggestion, best practice or case studies refer
>>> this situation?

Sent from my Android device with K-9 Mail. Please excuse my brevity.
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message