tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Tess4j API for TIKA OCR parser
Date Tue, 07 Mar 2017 15:43:46 GMT
Y and why not give the new tika-eval module a trial to evaluate the differences in output?
 :)

-----Original Message-----
From: Thamme Gowda [mailto:thammegowda@apache.org] 
Sent: Tuesday, March 7, 2017 10:38 AM
To: Thejan Wijesinghe <thejan.k.wijesinghe@gmail.com>
Cc: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be the ultimate goal
if using Tess4j proves to be better than the way it is done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing 
+ the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference 
+ to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how much time each
of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new one. Because TesseractOCRParser
is used along with many other parsers like JPEG/PDF etc any improvements you make with Tess4jOCRParser
will have a huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe < thejan.k.wijesinghe@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apache@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
>
Mime
View raw message