tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Albert Law (Logik)" <alb...@logik.com>
Subject Re: Tesseract OCR engine
Date Wed, 30 Nov 2011 21:48:40 GMT
Hi Chris,

I agree with Oleg.  Tesseract is free but requires training to get any
respectable OCR output.  Lastly, I found that Tesseract had memory
leaks (circa Sept. 2010).

Aside: I noticed Tesseract doesn't have pre-compiled builds nor a Java API.

On Wed, Nov 30, 2011 at 9:51 AM, Mattmann, Chris A (388J)
<chris.a.mattmann@jpl.nasa.gov> wrote:
> Hi Oleg,
>
> Thanks for the FYI, Oleg and the heads up on what needs to improve
> here.
>
> Cheers,
> Chris
>
> On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote:
>
>> Hi Chris,
>> I was playing with it recently.
>> One of the big issues with tesseract is a tough process of the preparing
>> training set for multiple fonts and languages.
>> In addition, we also have to add an option for image preprocessing (skewing
>> + filtering etc).
>>
>>
>> BR,
>> Oleg
>>
>> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) <
>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>
>>> Hey Guys,
>>>
>>> FYI: http://code.google.com/p/tesseract-ocr/
>>>
>>> I was pointed at this library by someone recently asking me if Tika
>>> was interested in integrating with this library. It's ALv2 licensed, and
>>> seems pretty interesting. I'm going to check it out, but just
>>> wanted to give everyone a heads up.
>>>
>>> Cheers,
>>> Chris
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>



-- 

Sincerely,
Albert Law
Senior Software Engineer
Logik.com

Mime
View raw message