lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Игорь Абрашин <vjiaste...@gmail.com>
Subject Re: Problem with cyrillics letters through Tika OCR indexing
Date Sat, 11 Feb 2017 05:49:56 GMT
The same problem for me. So, first case probably or how to force tika
parser recognize cyrillic character as required. For me it tries to
recognize russian text as eng translit, show up in result russian text
utilize only latin alphabet.

10 февр. 2017 г. 17:55 пользователь "Alexandre Rafalovitch" <
arafalov@gmail.com> написал:

> At what level is this exactly a problem? Are you looking for a way for
> Solr to pass -L rus flag to Tika?
>
> Or you are saying that whatever OCR is used here is bad. In the second
> case, this is probably not a question for Solr or even Tika but for
> whatever underlying OCR library is.
>
> The stack is deep here, more precision is required.
>
> Удачи,
>     Alex
>
> On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович" <
> Igor.Abrashin@novatek.ru> wrote:
>
> Hello, everyone I’m encountered the error mentioned at the title?
>
> The original image attached and recognized text below:
> 3ApaBCTyI7ITe 9| )KVIBy xopomo
>
>
>
> Does anyone faced the similar?
> Need to mentioned that tesseract recognize it more correctly with –l rus
> option.
>
> Thanks in advance!
>
>
>
>
>
> *С уважением, *
>
> *Игорь Абрашин*
>
> *ООО «НОВАТЭК НТЦ»*
>
> *тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>*
>
> *тел. внутр. корпор.: 22-586*
>
> [image: 121]
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message