tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Subject Re: Tess4j API for TIKA OCR parser
Date Tue, 07 Mar 2017 11:05:13 GMT
Hi Thamme,

I did minimal changes to the TesseractOCRParser class. I basically changed
the doOCR() private method. But the existing unit tests get failed even
though the content and metadata get extracted. Could you provide me with
any guidance on resolving these errors by running the test cases. I also
added some dependencies to the pom.xml in parsers. please check the links
below.

changed pom.xml:
https://github.com/ThejanW/tika/blob/master/tika-parsers/pom.xml

changed TesseractOCRParser class:
https://github.com/ThejanW/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java

On Tue, Mar 7, 2017 at 1:17 PM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Thamme,
> I have already use the Tess4j API to rewrite the TesseractOCRParser class,
> Although It successfully extracts content from most of the file types, it
> fails some particular unit tests in the TesseractOCRParserTest class. I can
> solve that. However, I want to know whether I can rewrite the entire
> TesseractOCRParser class from the ground up, but if I do that there will be
> many broken links in the internals of TIKA because as I witnessed, most of
> the classes use TesseractOCRParser class indirectly.
>
> On Mon, Mar 6, 2017 at 12:37 AM, Thamme Gowda <thammegowda@apache.org>
> wrote:
>
>> Thejan,
>>
>> Welcome to the world of mysteries. I am unable to explain why you are
>> facing it since I am unable to reproduce it.
>>
>> Try out few other images, may be the image you have chosen is corrupt and
>> maybe there is an exception thrown and silently swallowed in code.
>>
>> I suggest you do this:
>>    Please use an IDE like IntelliJ/Eclipse and use a debugger to
>> understand
>> the call stack inside TesseractOCRParser. It is indeed a nice way to get
>> to
>> the internals of Tika :-)
>>
>>
>> Best,
>> TG
>>
>>
>> *--*
>> *Thamme Gowda*
>> TG | @thammegowda <https://twitter.com/thammegowda>
>> ~Sent via somebody's Webmail server!
>>
>> On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
>> thejan.k.wijesinghe@gmail.com> wrote:
>>
>> >
>> > Hi Thamme,
>> >
>> > Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
>> > installed in my system using apt-get. Since, I wasn't sure whether this
>> is
>> > a problem with the APT software packages, I built both ImageMagick and
>> > Tesseract from sources.
>> >
>> > I also double checked the availability of Tesseract and ImageMagick by
>> > typing CLI commands that you suggested and the below commands as well,
>> >
>> > convert test.jpg -resize 64x64 resized_test.jpg
>> >
>> > tesseract test.jpg out
>> >
>> > and they worked.
>> >
>> > I can't find a exact reason why I am not getting metadata but when I
>> used
>> > the AutoDetectParser class instead of the TesseractOCRParser class, I
>> can
>> > extract both content and metadata.
>> >
>> > p.s. I will put updating the wiki OCR page in my TODO list :)
>> >
>>
>
>
>
> --
>
> [image: cutmypic.png]
>
> Thejan Wijesinghe
>
> Department of Computer Science and Engineering
>
> University of Moratuwa
>
> [image: phone-16.png]
>
> +94778097907
>
> [image: link.png] <http://www.your-website.com/> [image: linkedin.png]
> <https://www.linkedin.com/in/thejanw/> [image: github_alt.png]
> <https://github.com/ThejanW> [image: facebook.png]
> <https://www.facebook.com/ThejanW> [image: twitter.png]
> <https://twitter.com/Thejan_W> [image: google_plus.png]
> <https://plus.google.com/u/2/116268117882077683208> [image:
> skype_online_social_media-20.png] [image: mail-32.png]
> <thejankwijesinghe@gmail.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message