tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Subject Tess4j API for TIKA OCR parser
Date Sat, 04 Mar 2017 17:04:57 GMT
Hi Thamme,

Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
installed in my system using apt-get. Since, I wasn't sure whether this is
a problem with the APT software packages, I built both ImageMagick and
Tesseract from sources.

I also double checked the availability of Tesseract and ImageMagick by
typing CLI commands that you suggested and the below commands as well,

convert test.jpg -resize 64x64 resized_test.jpg

tesseract test.jpg out

and they worked.

I can't find a exact reason why I am not getting metadata but when I used
the AutoDetectParser class instead of the TesseractOCRParser class, I can
extract both content and metadata.

p.s. I will put updating the wiki OCR page in my TODO list :)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message