tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: OCR with tika-server
Date Wed, 01 Oct 2014 21:13:22 GMT
What type of image is it, Kevin?

If it’s a TIFF, you need to install tesseract with special lib tiff
parameters. See:

https://gist.github.com/henrik/1967035


Can you parse the image file with tesseract by itself, without
Tika’s tmp image?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Ramirez>, "Paul M   (398J)" <paul.m.ramirez@jpl.nasa.gov>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Wednesday, October 1, 2014 at 1:47 PM
To: "<dev@tika.apache.org>" <dev@tika.apache.org>
Subject: Re: OCR with tika-server

>Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>was just a typo issue and I randomly happen to catch that. I've
>definitely done that one before myself.
>
>Bummed that was not the problem.
>
>--Paul
>
>On Oct 1, 2014, at 1:00 PM, kevin slote <kslote1@gmail.com>
> wrote:
>
>> What I wrote there did have a typo in it. (It's not every day you get to
>> embarrass yourself in front of a bunch of guys from NASA)
>> 
>> But that was not what I had in my terminal when I tested it.
>> 
>> 
>> 
>> The actual PATH was:
>> 
>> 
>> 
>> 
>> 
>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>>ames:/usr/bin/tesseract"
>> 
>> 
>> 
>> I think what was actually wrong with the path is that I added the entire
>> path to the tesseract executable, which was in my /usr/bin/ directory,
>> instead of just the directory where tesseract lives.  Is this true?
>> 
>> 
>> 
>> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>printed
>> config.getTesseractPath() to stdout.  This field was empty.
>> 
>> However, I have tesseract installed system wide on this ubuntu vm.
>> 
>> So the canRun method evaluated as true whether or not the tesseractPath
>>was
>> configured correctly.
>> 
>> 
>> 
>> I have been slowly trying to debug this all day.  It looks like tika is
>> making a tmp file with the .tmp preffix.
>> 
>> I commented out some of the code to so that they remained in /tmp/.
>> 
>> 
>> 
>> It looks like tesseract doesn't like that.
>> 
>> I tried to ocr these .tmp files to see if I could isolate what was going
>> wrong for me.
>> 
>> 
>> 
>> kslote@ubuntu:~/tika/tika$ tesseract
>> /tmp/apache-tika-7112319184053570698.tmp out
>> 
>> Tesseract Open Source OCR Engine
>> 
>> name_to_image_type:Error:Unrecognized image
>> type:/tmp/apache-tika-7112319184053570698.tmp
>> 
>> IMAGE::read_header:Error:Can't read this image
>> type:/tmp/apache-tika-7112319184053570698.tmp
>> 
>> tesseract:Error:Read of file
>>failed:/tmp/apache-tika-7112319184053570698.tmp
>> 
>> Segmentation fault
>> 
>> 
>> 
>> On the wiki it mentions something about getting tesseract to work with
>> .tiff files.  For whatever reason, the tesseract I have installed only
>> works for .tiff files.  Would it be recommend that I re install
>>tesseract
>> from the source?
>> 
>> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>> paul.m.ramirez@jpl.nasa.gov> wrote:
>> 
>>> Is that a typo in your path to tesseract?
>>> 
>>> /urs/bin/tesseract => /usr/bin/tesseract
>>> 
>>> --Paul
>>> 
>>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <kslote1@gmail.com> wrote:
>>>> 
>>>> Unfortunately, that did not do it either.
>>>> 
>>>> I did:
>>>> 
>>>>  $export
>>>> 
>>> 
>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>>>ames:/urs/bin/tesseract
>>>> 
>>>> Here is the output from printenv
>>>> 
>>>> kslote@ubuntu:~/tika/tika$ printenv
>>>> SHELL=/bin/bash
>>>> USERNAME=kslote
>>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>>> DESKTOP_SESSION=gnome
>>>> 
>>> 
>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
>>>usr/games:/urs/bin/tesseract
>>>> PWD=/home/kslote/tika/tika
>>>> HOME=/home/kslote
>>>> LOGNAME=kslote
>>>> _=/usr/bin/printenv
>>>> 
>>>> 
>>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>>>><tpalsulich@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Hmm. Could you try adding tesseract to your PATH? How did you install
>>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>>> install
>>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>>> running
>>>>> into TIKA-1422, where a mail test fails. But, you can run just the
>>>>>OCR
>>>>> tests with `mvn test
>>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>>>> -DfailIfNoTests=false`.
>>>>> 
>>>>> Let me know if that works for you!
>>>>> Tyler
>>>>> 
>>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <kslote1@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>>>>> Tesseract is installed correctly, but just doing a clone from the
>>>>>>repo
>>>>> and
>>>>>> installing with maven, I am getting some errors.
>>>>>> 
>>>>>> This is before I did anything with tesseract installed.
>>>>>> 
>>>>>> Failed tests:
>>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>>>>> Check for the image's text.
>>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>> 
>>>>>> Next I hard coded the tesseractPath:
>>>>>> 
>>>>>> I went into the TesseractOCRConfig.java and hard coded
>>>>>>'tesseractPath.'
>>>>>> The all tests passed and it built successfully, but then I went to
>>>>>>post
>>>>>> some tiff's to the server.
>>>>>> That didn't work. So I tried adding some System.out.println("hello
>>>>> world")
>>>>>> (a little crude I know) inside the unit tests to confirm that
>>>>>>tesseract
>>>>>> was working correctly.  It looks like something happens in the unit
>>> test
>>>>> in
>>>>>> TesseractOCRTest.java
>>>>>> on the line that says TesseractOCRConfig config = new
>>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>>> nothing
>>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>>>>> exception
>>>>>> is not get raised.
>>>>>> 
>>>>>> Then once everything is built, ocr does not work.  That was why I
>>>>> figured I
>>>>>> would ask to see if I missed some sort of configuration step in
>>> building
>>>>>> it.
>>>>>> 
>>>>>> Thanks a ton.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>>>>> 
>>>>>>> Dear Kevin,
>>>>>>> 
>>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>>>>> 
>>>>>>> See this wiki page:
>>>>>>> 
>>>>>>> https://wiki.apache.org/tika/TikaOCR
>>>>>>> 
>>>>>>> I¹d be happy to discuss more.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Chris
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Chief Architect
>>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>> Email: chris.a.mattmann@nasa.gov
>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: kevin slote <kslote1@gmail.com>
>>>>>>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
>>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>>>>> To: "dev@tika.apache.org" <dev@tika.apache.org>
>>>>>>> Subject: OCR with tika-server
>>>>>>> 
>>>>>>>> Hello all,
>>>>>>>> 
>>>>>>>> I have been testing out the integration of tika with tesseract.
>>>>>>>> I was wondering if there is  a way to get tika-server to
run with
>>>>>>>> tesseract's OCR capabilities?
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> 
>>>>>>>> Kevin Slote
>>>>> 
>>> 
>

Mime
View raw message