tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kevin slote <kslo...@gmail.com>
Subject Re: OCR with tika-server
Date Fri, 03 Oct 2014 23:10:59 GMT
Hi all,

I just confirmed that the problem was that my version of tesseract was too
old.
Maybe it would be a good idea to put something in the canRun method at the
top of the tesseract unit test to also check that the version of tesseract
is relevant?

Older versions of tesseract do not have a "-v" or "--version" flag.  So
maybe use ProcessBuilder to run that command and parse the string to see if
it returned an error?

Thanks for everyone's help.

On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <kslote1@gmail.com> wrote:

> Thanks for following up!
>
> I was trying to dig deeper before I responded.
>
> Tyler,
>
> I followed those instructions.  My version of Tesseract does not ocr the
> google logo because it is not a tiff.  I used imagemagick to convert it to
> a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
> bpp are supported:32" error which usually means it needs to be re-sized
> with imagemagick.
>
>
> Chris,
>
> I wrote a python wrapper for tesseract that can parse the documents that
> were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
> looks like right now, in TesseractOCRParser.java, the command line argument
> that is passed to the os points to a .tmp file in /tmp/.
>
> So the command that is executed is
>
>    "tesseract /tmp/apache-tika-2409864150710514587.tmp
> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>
> This is not working for me.  When I grab those .tmp files and try to ocr
> them from the command line, tesseract gets thrown for a loop.
>
> From what I can tell, is the tesseract I have installed can only handle
> .tif files.
> I can back this up by citing the tesseract page:
> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>
>  If Tesseract isn't available for your distribution, or you want to use a
> newer version than they offer, you can compile your own
> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that  older
> versions of Tesseract only supported processing .tiff files.
>
> So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
> solve my problems.
>
> I will let the listserv know if that fixes it.
>
>
> Kevin Slote
>
>
>
> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> What type of image is it, Kevin?
>>
>> If it’s a TIFF, you need to install tesseract with special lib tiff
>> parameters. See:
>>
>> https://gist.github.com/henrik/1967035
>>
>>
>> Can you parse the image file with tesseract by itself, without
>> Tika’s tmp image?
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <Ramirez>, "Paul M   (398J)" <paul.m.ramirez@jpl.nasa.gov>
>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
>> Date: Wednesday, October 1, 2014 at 1:47 PM
>> To: "<dev@tika.apache.org>" <dev@tika.apache.org>
>> Subject: Re: OCR with tika-server
>>
>> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>> >was just a typo issue and I randomly happen to catch that. I've
>> >definitely done that one before myself.
>> >
>> >Bummed that was not the problem.
>> >
>> >--Paul
>> >
>> >On Oct 1, 2014, at 1:00 PM, kevin slote <kslote1@gmail.com>
>> > wrote:
>> >
>> >> What I wrote there did have a typo in it. (It's not every day you get
>> to
>> >> embarrass yourself in front of a bunch of guys from NASA)
>> >>
>> >> But that was not what I had in my terminal when I tested it.
>> >>
>> >>
>> >>
>> >> The actual PATH was:
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>> >>ames:/usr/bin/tesseract"
>> >>
>> >>
>> >>
>> >> I think what was actually wrong with the path is that I added the
>> entire
>> >> path to the tesseract executable, which was in my /usr/bin/ directory,
>> >> instead of just the directory where tesseract lives.  Is this true?
>> >>
>> >>
>> >>
>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
>> >>printed
>> >> config.getTesseractPath() to stdout.  This field was empty.
>> >>
>> >> However, I have tesseract installed system wide on this ubuntu vm.
>> >>
>> >> So the canRun method evaluated as true whether or not the tesseractPath
>> >>was
>> >> configured correctly.
>> >>
>> >>
>> >>
>> >> I have been slowly trying to debug this all day.  It looks like tika is
>> >> making a tmp file with the .tmp preffix.
>> >>
>> >> I commented out some of the code to so that they remained in /tmp/.
>> >>
>> >>
>> >>
>> >> It looks like tesseract doesn't like that.
>> >>
>> >> I tried to ocr these .tmp files to see if I could isolate what was
>> going
>> >> wrong for me.
>> >>
>> >>
>> >>
>> >> kslote@ubuntu:~/tika/tika$ tesseract
>> >> /tmp/apache-tika-7112319184053570698.tmp out
>> >>
>> >> Tesseract Open Source OCR Engine
>> >>
>> >> name_to_image_type:Error:Unrecognized image
>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>> >>
>> >> IMAGE::read_header:Error:Can't read this image
>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>> >>
>> >> tesseract:Error:Read of file
>> >>failed:/tmp/apache-tika-7112319184053570698.tmp
>> >>
>> >> Segmentation fault
>> >>
>> >>
>> >>
>> >> On the wiki it mentions something about getting tesseract to work with
>> >> .tiff files.  For whatever reason, the tesseract I have installed only
>> >> works for .tiff files.  Would it be recommend that I re install
>> >>tesseract
>> >> from the source?
>> >>
>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>> >> paul.m.ramirez@jpl.nasa.gov> wrote:
>> >>
>> >>> Is that a typo in your path to tesseract?
>> >>>
>> >>> /urs/bin/tesseract => /usr/bin/tesseract
>> >>>
>> >>> --Paul
>> >>>
>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <kslote1@gmail.com>
>> wrote:
>> >>>>
>> >>>> Unfortunately, that did not do it either.
>> >>>>
>> >>>> I did:
>> >>>>
>> >>>>  $export
>> >>>>
>> >>>
>>
>> >>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>> >>>ames:/urs/bin/tesseract
>> >>>>
>> >>>> Here is the output from printenv
>> >>>>
>> >>>> kslote@ubuntu:~/tika/tika$ printenv
>> >>>> SHELL=/bin/bash
>> >>>> USERNAME=kslote
>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>> >>>> DESKTOP_SESSION=gnome
>> >>>>
>> >>>
>>
>> >>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
>> >>>usr/games:/urs/bin/tesseract
>> >>>> PWD=/home/kslote/tika/tika
>> >>>> HOME=/home/kslote
>> >>>> LOGNAME=kslote
>> >>>> _=/usr/bin/printenv
>> >>>>
>> >>>>
>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>> >>>><tpalsulich@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you
>> install
>> >>>>> Tesseract? You should be able to do a straightforward `sudo
apt-get
>> >>> install
>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're
still
>> >>> running
>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just
the
>> >>>>>OCR
>> >>>>> tests with `mvn test
>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>> >>>>> -DfailIfNoTests=false`.
>> >>>>>
>> >>>>> Let me know if that works for you!
>> >>>>> Tyler
>> >>>>>
>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <kslote1@gmail.com>
>> >>> wrote:
>> >>>>>>
>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>> >>>>>> Tesseract is installed correctly, but just doing a clone
from the
>> >>>>>>repo
>> >>>>> and
>> >>>>>> installing with maven, I am getting some errors.
>> >>>>>>
>> >>>>>> This is before I did anything with tesseract installed.
>> >>>>>>
>> >>>>>> Failed tests:
>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>> >>>>>> Check for the image's text.
>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>> >>>>>>
>> >>>>>> Next I hard coded the tesseractPath:
>> >>>>>>
>> >>>>>> I went into the TesseractOCRConfig.java and hard coded
>> >>>>>>'tesseractPath.'
>> >>>>>> The all tests passed and it built successfully, but then
I went to
>> >>>>>>post
>> >>>>>> some tiff's to the server.
>> >>>>>> That didn't work. So I tried adding some System.out.println("hello
>> >>>>> world")
>> >>>>>> (a little crude I know) inside the unit tests to confirm
that
>> >>>>>>tesseract
>> >>>>>> was working correctly.  It looks like something happens
in the unit
>> >>> test
>> >>>>> in
>> >>>>>> TesseractOCRTest.java
>> >>>>>> on the line that says TesseractOCRConfig config = new
>> >>>>>> TesseractOCRConfig();. Printing to stdout before works,
but I get
>> >>> nothing
>> >>>>>> after. That happens before the assumeTrue(canRun(config));.
So an
>> >>>>> exception
>> >>>>>> is not get raised.
>> >>>>>>
>> >>>>>> Then once everything is built, ocr does not work.  That
was why I
>> >>>>> figured I
>> >>>>>> would ask to see if I missed some sort of configuration
step in
>> >>> building
>> >>>>>> it.
>> >>>>>>
>> >>>>>> Thanks a ton.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980)
<
>> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>> >>>>>>
>> >>>>>>> Dear Kevin,
>> >>>>>>>
>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>> >>>>>>>
>> >>>>>>> See this wiki page:
>> >>>>>>>
>> >>>>>>> https://wiki.apache.org/tika/TikaOCR
>> >>>>>>>
>> >>>>>>> I¹d be happy to discuss more.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> Cheers,
>> >>>>>>> Chris
>> >>>>>>>
>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>> Chris Mattmann, Ph.D.
>> >>>>>>> Chief Architect
>> >>>>>>> Instrument Software and Science Data Systems Section
(398)
>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>>>>> Office: 168-519, Mailstop: 168-527
>> >>>>>>> Email: chris.a.mattmann@nasa.gov
>> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>> Adjunct Associate Professor, Computer Science Department
>> >>>>>>> University of Southern California, Los Angeles, CA 90089
USA
>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> -----Original Message-----
>> >>>>>>> From: kevin slote <kslote1@gmail.com>
>> >>>>>>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>> >>>>>>> To: "dev@tika.apache.org" <dev@tika.apache.org>
>> >>>>>>> Subject: OCR with tika-server
>> >>>>>>>
>> >>>>>>>> Hello all,
>> >>>>>>>>
>> >>>>>>>> I have been testing out the integration of tika
with tesseract.
>> >>>>>>>> I was wondering if there is  a way to get tika-server
to run with
>> >>>>>>>> tesseract's OCR capabilities?
>> >>>>>>>>
>> >>>>>>>> Best
>> >>>>>>>>
>> >>>>>>>> Kevin Slote
>> >>>>>
>> >>>
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message