tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Subject Re: Tess4j API for TIKA OCR parser
Date Tue, 07 Mar 2017 07:47:33 GMT
Thamme,
I have already use the Tess4j API to rewrite the TesseractOCRParser class,
Although It successfully extracts content from most of the file types, it
fails some particular unit tests in the TesseractOCRParserTest class. I can
solve that. However, I want to know whether I can rewrite the entire
TesseractOCRParser class from the ground up, but if I do that there will be
many broken links in the internals of TIKA because as I witnessed, most of
the classes use TesseractOCRParser class indirectly.

On Mon, Mar 6, 2017 at 12:37 AM, Thamme Gowda <thammegowda@apache.org>
wrote:

> Thejan,
>
> Welcome to the world of mysteries. I am unable to explain why you are
> facing it since I am unable to reproduce it.
>
> Try out few other images, may be the image you have chosen is corrupt and
> maybe there is an exception thrown and silently swallowed in code.
>
> I suggest you do this:
>    Please use an IDE like IntelliJ/Eclipse and use a debugger to understand
> the call stack inside TesseractOCRParser. It is indeed a nice way to get to
> the internals of Tika :-)
>
>
> Best,
> TG
>
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
>
> On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
> thejan.k.wijesinghe@gmail.com> wrote:
>
> >
> > Hi Thamme,
> >
> > Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
> > installed in my system using apt-get. Since, I wasn't sure whether this
> is
> > a problem with the APT software packages, I built both ImageMagick and
> > Tesseract from sources.
> >
> > I also double checked the availability of Tesseract and ImageMagick by
> > typing CLI commands that you suggested and the below commands as well,
> >
> > convert test.jpg -resize 64x64 resized_test.jpg
> >
> > tesseract test.jpg out
> >
> > and they worked.
> >
> > I can't find a exact reason why I am not getting metadata but when I used
> > the AutoDetectParser class instead of the TesseractOCRParser class, I can
> > extract both content and metadata.
> >
> > p.s. I will put updating the wiki OCR page in my TODO list :)
> >
>



-- 

[image: cutmypic.png]

Thejan Wijesinghe

Department of Computer Science and Engineering

University of Moratuwa

[image: phone-16.png]

+94778097907

[image: link.png] <http://www.your-website.com/> [image: linkedin.png]
<https://www.linkedin.com/in/thejanw/> [image: github_alt.png]
<https://github.com/ThejanW> [image: facebook.png]
<https://www.facebook.com/ThejanW> [image: twitter.png]
<https://twitter.com/Thejan_W> [image: google_plus.png]
<https://plus.google.com/u/2/116268117882077683208> [image:
skype_online_social_media-20.png] [image: mail-32.png]
<thejankwijesinghe@gmail.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message