tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luís Filipe Nassif <lfcnas...@gmail.com>
Subject Re: Tess4j API for TIKA OCR parser
Date Tue, 07 Mar 2017 15:33:37 GMT
Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use
Tess4j, that was 4 years ago. Unfortunatelly that time I run into some
problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm
crashes because of bugs into native code (pointers to crazy adresses) when
processing corrupted images. So I changed the strategy and take the
Runtime.exec way to execute tesseract out of process to get rid of those
Jvm crashes.

That was a long time ago, maybe those problems are gone away with current
tesseract and Tess4j. But I recommend for now commiting your changes in a
new parser instead of changing the default TesseractOcrParser, until the
new code is tested against millions of images from the wild with tika-batch
so it can be proved it is stable enough to be the default Ocr parser of
Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" <
thejan.k.wijesinghe@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apache@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message