tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Tess4j API for TIKA OCR parser
Date Tue, 07 Mar 2017 15:44:06 GMT

Same experience, of same vintage. :)

-----Original Message-----
From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
Sent: Tuesday, March 7, 2017 10:34 AM
To: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was
4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j
and, even worse, Jvm crashes because of bugs into native code (pointers to crazy adresses)
when processing corrupted images. So I changed the strategy and take the Runtime.exec way
to execute tesseract out of process to get rid of those Jvm crashes.

That was a long time ago, maybe those problems are gone away with current tesseract and Tess4j.
But I recommend for now commiting your changes in a new parser instead of changing the default
TesseractOcrParser, until the new code is tested against millions of images from the wild
with tika-batch so it can be proved it is stable enough to be the default Ocr parser of Tika.


Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" < thejan.k.wijesinghe@gmail.com> escreveu:

> Hi Nick,
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apache@gagravarr.org> wrote:
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
View raw message