tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kevin slote <kslo...@gmail.com>
Subject Re: OCR with tika-server
Date Fri, 03 Oct 2014 18:30:53 GMT
Thanks for following up!

I was trying to dig deeper before I responded.

Tyler,

I followed those instructions.  My version of Tesseract does not ocr the
google logo because it is not a tiff.  I used imagemagick to convert it to
a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
bpp are supported:32" error which usually means it needs to be re-sized
with imagemagick.


Chris,

I wrote a python wrapper for tesseract that can parse the documents that
were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
looks like right now, in TesseractOCRParser.java, the command line argument
that is passed to the os points to a .tmp file in /tmp/.

So the command that is executed is

   "tesseract /tmp/apache-tika-2409864150710514587.tmp
/tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"

This is not working for me.  When I grab those .tmp files and try to ocr
them from the command line, tesseract gets thrown for a loop.

>From what I can tell, is the tesseract I have installed can only handle
.tif files.
I can back this up by citing the tesseract page:
https://code.google.com/p/tesseract-ocr/wiki/ReadMe

 If Tesseract isn't available for your distribution, or you want to use a
newer version than they offer, you can compile your own
<https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that  older
versions of Tesseract only supported processing .tiff files.

So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
solve my problems.

I will let the listserv know if that fixes it.


Kevin Slote



On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> What type of image is it, Kevin?
>
> If it’s a TIFF, you need to install tesseract with special lib tiff
> parameters. See:
>
> https://gist.github.com/henrik/1967035
>
>
> Can you parse the image file with tesseract by itself, without
> Tika’s tmp image?
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: <Ramirez>, "Paul M   (398J)" <paul.m.ramirez@jpl.nasa.gov>
> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> Date: Wednesday, October 1, 2014 at 1:47 PM
> To: "<dev@tika.apache.org>" <dev@tika.apache.org>
> Subject: Re: OCR with tika-server
>
> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
> >was just a typo issue and I randomly happen to catch that. I've
> >definitely done that one before myself.
> >
> >Bummed that was not the problem.
> >
> >--Paul
> >
> >On Oct 1, 2014, at 1:00 PM, kevin slote <kslote1@gmail.com>
> > wrote:
> >
> >> What I wrote there did have a typo in it. (It's not every day you get to
> >> embarrass yourself in front of a bunch of guys from NASA)
> >>
> >> But that was not what I had in my terminal when I tested it.
> >>
> >>
> >>
> >> The actual PATH was:
> >>
> >>
> >>
> >>
> >>
> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
> >>ames:/usr/bin/tesseract"
> >>
> >>
> >>
> >> I think what was actually wrong with the path is that I added the entire
> >> path to the tesseract executable, which was in my /usr/bin/ directory,
> >> instead of just the directory where tesseract lives.  Is this true?
> >>
> >>
> >>
> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
> >>printed
> >> config.getTesseractPath() to stdout.  This field was empty.
> >>
> >> However, I have tesseract installed system wide on this ubuntu vm.
> >>
> >> So the canRun method evaluated as true whether or not the tesseractPath
> >>was
> >> configured correctly.
> >>
> >>
> >>
> >> I have been slowly trying to debug this all day.  It looks like tika is
> >> making a tmp file with the .tmp preffix.
> >>
> >> I commented out some of the code to so that they remained in /tmp/.
> >>
> >>
> >>
> >> It looks like tesseract doesn't like that.
> >>
> >> I tried to ocr these .tmp files to see if I could isolate what was going
> >> wrong for me.
> >>
> >>
> >>
> >> kslote@ubuntu:~/tika/tika$ tesseract
> >> /tmp/apache-tika-7112319184053570698.tmp out
> >>
> >> Tesseract Open Source OCR Engine
> >>
> >> name_to_image_type:Error:Unrecognized image
> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> IMAGE::read_header:Error:Can't read this image
> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> tesseract:Error:Read of file
> >>failed:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> Segmentation fault
> >>
> >>
> >>
> >> On the wiki it mentions something about getting tesseract to work with
> >> .tiff files.  For whatever reason, the tesseract I have installed only
> >> works for .tiff files.  Would it be recommend that I re install
> >>tesseract
> >> from the source?
> >>
> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
> >> paul.m.ramirez@jpl.nasa.gov> wrote:
> >>
> >>> Is that a typo in your path to tesseract?
> >>>
> >>> /urs/bin/tesseract => /usr/bin/tesseract
> >>>
> >>> --Paul
> >>>
> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <kslote1@gmail.com>
wrote:
> >>>>
> >>>> Unfortunately, that did not do it either.
> >>>>
> >>>> I did:
> >>>>
> >>>>  $export
> >>>>
> >>>
> >>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
> >>>ames:/urs/bin/tesseract
> >>>>
> >>>> Here is the output from printenv
> >>>>
> >>>> kslote@ubuntu:~/tika/tika$ printenv
> >>>> SHELL=/bin/bash
> >>>> USERNAME=kslote
> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> >>>> DESKTOP_SESSION=gnome
> >>>>
> >>>
> >>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
> >>>usr/games:/urs/bin/tesseract
> >>>> PWD=/home/kslote/tika/tika
> >>>> HOME=/home/kslote
> >>>> LOGNAME=kslote
> >>>> _=/usr/bin/printenv
> >>>>
> >>>>
> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
> >>>><tpalsulich@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you install
> >>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
> >>> install
> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
> >>> running
> >>>>> into TIKA-1422, where a mail test fails. But, you can run just the
> >>>>>OCR
> >>>>> tests with `mvn test
> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >>>>> -DfailIfNoTests=false`.
> >>>>>
> >>>>> Let me know if that works for you!
> >>>>> Tyler
> >>>>>
> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <kslote1@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
> >>>>>> Tesseract is installed correctly, but just doing a clone from
the
> >>>>>>repo
> >>>>> and
> >>>>>> installing with maven, I am getting some errors.
> >>>>>>
> >>>>>> This is before I did anything with tesseract installed.
> >>>>>>
> >>>>>> Failed tests:
> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>>>>> Check for the image's text.
> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>>>>
> >>>>>> Next I hard coded the tesseractPath:
> >>>>>>
> >>>>>> I went into the TesseractOCRConfig.java and hard coded
> >>>>>>'tesseractPath.'
> >>>>>> The all tests passed and it built successfully, but then I went
to
> >>>>>>post
> >>>>>> some tiff's to the server.
> >>>>>> That didn't work. So I tried adding some System.out.println("hello
> >>>>> world")
> >>>>>> (a little crude I know) inside the unit tests to confirm that
> >>>>>>tesseract
> >>>>>> was working correctly.  It looks like something happens in the
unit
> >>> test
> >>>>> in
> >>>>>> TesseractOCRTest.java
> >>>>>> on the line that says TesseractOCRConfig config = new
> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but
I get
> >>> nothing
> >>>>>> after. That happens before the assumeTrue(canRun(config));.
So an
> >>>>> exception
> >>>>>> is not get raised.
> >>>>>>
> >>>>>> Then once everything is built, ocr does not work.  That was
why I
> >>>>> figured I
> >>>>>> would ask to see if I missed some sort of configuration step
in
> >>> building
> >>>>>> it.
> >>>>>>
> >>>>>> Thanks a ton.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>>>>>
> >>>>>>> Dear Kevin,
> >>>>>>>
> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
> >>>>>>>
> >>>>>>> See this wiki page:
> >>>>>>>
> >>>>>>> https://wiki.apache.org/tika/TikaOCR
> >>>>>>>
> >>>>>>> I¹d be happy to discuss more.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Chris
> >>>>>>>
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>> Chris Mattmann, Ph.D.
> >>>>>>> Chief Architect
> >>>>>>> Instrument Software and Science Data Systems Section (398)
> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>> Office: 168-519, Mailstop: 168-527
> >>>>>>> Email: chris.a.mattmann@nasa.gov
> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>> Adjunct Associate Professor, Computer Science Department
> >>>>>>> University of Southern California, Los Angeles, CA 90089
USA
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: kevin slote <kslote1@gmail.com>
> >>>>>>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
> >>>>>>> To: "dev@tika.apache.org" <dev@tika.apache.org>
> >>>>>>> Subject: OCR with tika-server
> >>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>>
> >>>>>>>> I have been testing out the integration of tika with
tesseract.
> >>>>>>>> I was wondering if there is  a way to get tika-server
to run with
> >>>>>>>> tesseract's OCR capabilities?
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>>
> >>>>>>>> Kevin Slote
> >>>>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message