tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thamme Gowda <thammego...@apache.org>
Subject Re: Tess4j API for TIKA OCR parser
Date Tue, 07 Mar 2017 15:37:36 GMT
Thanks Nick for the reply.


I am glad to know your progress. Rewriting the TesseractOCRParser would be
the ultimate goal if using Tess4j proves to be better than the way it is
done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how
much time each of them took. Please share those results here.

Based on the benchmark, we can decide whether to replace old one with new
one. Because TesseractOCRParser is used along with many other parsers like
JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
huge effect!

+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Hi Nick,
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apache@gagravarr.org> wrote:
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message