tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thamme Gowda <thammego...@apache.org>
Subject Re: Tess4j API for TIKA OCR parser
Date Sun, 05 Mar 2017 19:07:03 GMT
Thejan,

Welcome to the world of mysteries. I am unable to explain why you are
facing it since I am unable to reproduce it.

Try out few other images, may be the image you have chosen is corrupt and
maybe there is an exception thrown and silently swallowed in code.

I suggest you do this:
   Please use an IDE like IntelliJ/Eclipse and use a debugger to understand
the call stack inside TesseractOCRParser. It is indeed a nice way to get to
the internals of Tika :-)


Best,
TG


*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

>
> Hi Thamme,
>
> Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
> installed in my system using apt-get. Since, I wasn't sure whether this is
> a problem with the APT software packages, I built both ImageMagick and
> Tesseract from sources.
>
> I also double checked the availability of Tesseract and ImageMagick by
> typing CLI commands that you suggested and the below commands as well,
>
> convert test.jpg -resize 64x64 resized_test.jpg
>
> tesseract test.jpg out
>
> and they worked.
>
> I can't find a exact reason why I am not getting metadata but when I used
> the AutoDetectParser class instead of the TesseractOCRParser class, I can
> extract both content and metadata.
>
> p.s. I will put updating the wiki OCR page in my TODO list :)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message