tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kevin slote <kslo...@gmail.com>
Subject Re: OCR with tika-server
Date Mon, 06 Oct 2014 15:48:21 GMT
Ok, I am signed up.

https://wiki.apache.org/tika/Kevin%20Slote

On Fri, Oct 3, 2014 at 11:02 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Kevin glad it is now fixed with you!
>
> If you get a chance, please feel free to document
> this on the wiki:
>
> https://wiki.apache.org/tika/TikaOCR
>
>
> You can sign up for an account, and then I can grant
> you permissions to edit the file. Let me know!
>
> Cheers,
> Chris
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: kevin slote <kslote1@gmail.com>
> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> Date: Friday, October 3, 2014 at 4:10 PM
> To: "dev@tika.apache.org" <dev@tika.apache.org>
> Subject: Re: OCR with tika-server
>
> >Hi all,
> >
> >I just confirmed that the problem was that my version of tesseract was too
> >old.
> >Maybe it would be a good idea to put something in the canRun method at the
> >top of the tesseract unit test to also check that the version of tesseract
> >is relevant?
> >
> >Older versions of tesseract do not have a "-v" or "--version" flag.  So
> >maybe use ProcessBuilder to run that command and parse the string to see
> >if
> >it returned an error?
> >
> >Thanks for everyone's help.
> >
> >On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <kslote1@gmail.com> wrote:
> >
> >> Thanks for following up!
> >>
> >> I was trying to dig deeper before I responded.
> >>
> >> Tyler,
> >>
> >> I followed those instructions.  My version of Tesseract does not ocr the
> >> google logo because it is not a tiff.  I used imagemagick to convert it
> >>to
> >> a tif and tesseract returned "check_legal_image_size:Error:Only
> >>1,2,4,5,6,8
> >> bpp are supported:32" error which usually means it needs to be re-sized
> >> with imagemagick.
> >>
> >>
> >> Chris,
> >>
> >> I wrote a python wrapper for tesseract that can parse the documents that
> >> were in your test-document repository concerning OCR (testOCR.pdf,
> >>etc.) It
> >> looks like right now, in TesseractOCRParser.java, the command line
> >>argument
> >> that is passed to the os points to a .tmp file in /tmp/.
> >>
> >> So the command that is executed is
> >>
> >>    "tesseract /tmp/apache-tika-2409864150710514587.tmp
> >> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
> >>
> >> This is not working for me.  When I grab those .tmp files and try to ocr
> >> them from the command line, tesseract gets thrown for a loop.
> >>
> >> From what I can tell, is the tesseract I have installed can only handle
> >> .tif files.
> >> I can back this up by citing the tesseract page:
> >> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
> >>
> >>  If Tesseract isn't available for your distribution, or you want to use
> >>a
> >> newer version than they offer, you can compile your own
> >> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that
> >>older
> >> versions of Tesseract only supported processing .tiff files.
> >>
> >> So, I think that upgrading tesseract or moving to ubuntu 12 or higher
> >>will
> >> solve my problems.
> >>
> >> I will let the listserv know if that fixes it.
> >>
> >>
> >> Kevin Slote
> >>
> >>
> >>
> >> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> >> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>
> >>> What type of image is it, Kevin?
> >>>
> >>> If it’s a TIFF, you need to install tesseract with special lib tiff
> >>> parameters. See:
> >>>
> >>> https://gist.github.com/henrik/1967035
> >>>
> >>>
> >>> Can you parse the image file with tesseract by itself, without
> >>> Tika’s tmp image?
> >>>
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Chief Architect
> >>> Instrument Software and Science Data Systems Section (398)
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 168-519, Mailstop: 168-527
> >>> Email: chris.a.mattmann@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Associate Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: <Ramirez>, "Paul M   (398J)" <paul.m.ramirez@jpl.nasa.gov>
> >>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> >>> Date: Wednesday, October 1, 2014 at 1:47 PM
> >>> To: "<dev@tika.apache.org>" <dev@tika.apache.org>
> >>> Subject: Re: OCR with tika-server
> >>>
> >>> >Nothing to be embarrassed about at all Kevin. I actually thought
> >>>maybe it
> >>> >was just a typo issue and I randomly happen to catch that. I've
> >>> >definitely done that one before myself.
> >>> >
> >>> >Bummed that was not the problem.
> >>> >
> >>> >--Paul
> >>> >
> >>> >On Oct 1, 2014, at 1:00 PM, kevin slote <kslote1@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> What I wrote there did have a typo in it. (It's not every day you
> >>>get
> >>> to
> >>> >> embarrass yourself in front of a bunch of guys from NASA)
> >>> >>
> >>> >> But that was not what I had in my terminal when I tested it.
> >>> >>
> >>> >>
> >>> >>
> >>> >> The actual PATH was:
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>>
> >>>
> >>>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
> >>>>>r/g
> >>> >>ames:/usr/bin/tesseract"
> >>> >>
> >>> >>
> >>> >>
> >>> >> I think what was actually wrong with the path is that I added the
> >>> entire
> >>> >> path to the tesseract executable, which was in my /usr/bin/
> >>>directory,
> >>> >> instead of just the directory where tesseract lives.  Is this true?
> >>> >>
> >>> >>
> >>> >>
> >>> >> I deleted the hard coding from the TesseractOCRConfig.jave and
then
> >>> >>printed
> >>> >> config.getTesseractPath() to stdout.  This field was empty.
> >>> >>
> >>> >> However, I have tesseract installed system wide on this ubuntu
vm.
> >>> >>
> >>> >> So the canRun method evaluated as true whether or not the
> >>>tesseractPath
> >>> >>was
> >>> >> configured correctly.
> >>> >>
> >>> >>
> >>> >>
> >>> >> I have been slowly trying to debug this all day.  It looks like
> >>>tika is
> >>> >> making a tmp file with the .tmp preffix.
> >>> >>
> >>> >> I commented out some of the code to so that they remained in /tmp/.
> >>> >>
> >>> >>
> >>> >>
> >>> >> It looks like tesseract doesn't like that.
> >>> >>
> >>> >> I tried to ocr these .tmp files to see if I could isolate what
was
> >>> going
> >>> >> wrong for me.
> >>> >>
> >>> >>
> >>> >>
> >>> >> kslote@ubuntu:~/tika/tika$ tesseract
> >>> >> /tmp/apache-tika-7112319184053570698.tmp out
> >>> >>
> >>> >> Tesseract Open Source OCR Engine
> >>> >>
> >>> >> name_to_image_type:Error:Unrecognized image
> >>> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>> >>
> >>> >> IMAGE::read_header:Error:Can't read this image
> >>> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>> >>
> >>> >> tesseract:Error:Read of file
> >>> >>failed:/tmp/apache-tika-7112319184053570698.tmp
> >>> >>
> >>> >> Segmentation fault
> >>> >>
> >>> >>
> >>> >>
> >>> >> On the wiki it mentions something about getting tesseract to work
> >>>with
> >>> >> .tiff files.  For whatever reason, the tesseract I have installed
> >>>only
> >>> >> works for .tiff files.  Would it be recommend that I re install
> >>> >>tesseract
> >>> >> from the source?
> >>> >>
> >>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
> >>> >> paul.m.ramirez@jpl.nasa.gov> wrote:
> >>> >>
> >>> >>> Is that a typo in your path to tesseract?
> >>> >>>
> >>> >>> /urs/bin/tesseract => /usr/bin/tesseract
> >>> >>>
> >>> >>> --Paul
> >>> >>>
> >>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <kslote1@gmail.com>
> >>> wrote:
> >>> >>>>
> >>> >>>> Unfortunately, that did not do it either.
> >>> >>>>
> >>> >>>> I did:
> >>> >>>>
> >>> >>>>  $export
> >>> >>>>
> >>> >>>
> >>>
> >>>
> >>>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
> >>>>>>r/g
> >>> >>>ames:/urs/bin/tesseract
> >>> >>>>
> >>> >>>> Here is the output from printenv
> >>> >>>>
> >>> >>>> kslote@ubuntu:~/tika/tika$ printenv
> >>> >>>> SHELL=/bin/bash
> >>> >>>> USERNAME=kslote
> >>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> >>> >>>> DESKTOP_SESSION=gnome
> >>> >>>>
> >>> >>>
> >>>
> >>>
> >>>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
> >>>>>>n:/
> >>> >>>usr/games:/urs/bin/tesseract
> >>> >>>> PWD=/home/kslote/tika/tika
> >>> >>>> HOME=/home/kslote
> >>> >>>> LOGNAME=kslote
> >>> >>>> _=/usr/bin/printenv
> >>> >>>>
> >>> >>>>
> >>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
> >>> >>>><tpalsulich@gmail.com>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>>> Hi,
> >>> >>>>>
> >>> >>>>> Hmm. Could you try adding tesseract to your PATH? How
did you
> >>> install
> >>> >>>>> Tesseract? You should be able to do a straightforward
`sudo
> >>>apt-get
> >>> >>> install
> >>> >>>>> tesseract-ocr`. After that, the OCR tests should pass.
We're
> >>>still
> >>> >>> running
> >>> >>>>> into TIKA-1422, where a mail test fails. But, you can
run just
> >>>the
> >>> >>>>>OCR
> >>> >>>>> tests with `mvn test
> >>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >>> >>>>> -DfailIfNoTests=false`.
> >>> >>>>>
> >>> >>>>> Let me know if that works for you!
> >>> >>>>> Tyler
> >>> >>>>>
> >>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <kslote1@gmail.com
> >
> >>> >>> wrote:
> >>> >>>>>>
> >>> >>>>>> I am working on ubuntu 10.4. and I am having some
trouble.
> >>> >>>>>> Tesseract is installed correctly, but just doing
a clone from
> >>>the
> >>> >>>>>>repo
> >>> >>>>> and
> >>> >>>>>> installing with maven, I am getting some errors.
> >>> >>>>>>
> >>> >>>>>> This is before I did anything with tesseract installed.
> >>> >>>>>>
> >>> >>>>>> Failed tests:
> >>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>> >>>>>> Check for the image's text.
> >>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>> >>>>>>
> >>> >>>>>> Next I hard coded the tesseractPath:
> >>> >>>>>>
> >>> >>>>>> I went into the TesseractOCRConfig.java and hard
coded
> >>> >>>>>>'tesseractPath.'
> >>> >>>>>> The all tests passed and it built successfully,
but then I went
> >>>to
> >>> >>>>>>post
> >>> >>>>>> some tiff's to the server.
> >>> >>>>>> That didn't work. So I tried adding some
> >>>System.out.println("hello
> >>> >>>>> world")
> >>> >>>>>> (a little crude I know) inside the unit tests to
confirm that
> >>> >>>>>>tesseract
> >>> >>>>>> was working correctly.  It looks like something
happens in the
> >>>unit
> >>> >>> test
> >>> >>>>> in
> >>> >>>>>> TesseractOCRTest.java
> >>> >>>>>> on the line that says TesseractOCRConfig config
= new
> >>> >>>>>> TesseractOCRConfig();. Printing to stdout before
works, but I
> >>>get
> >>> >>> nothing
> >>> >>>>>> after. That happens before the assumeTrue(canRun(config));.
So
> >>>an
> >>> >>>>> exception
> >>> >>>>>> is not get raised.
> >>> >>>>>>
> >>> >>>>>> Then once everything is built, ocr does not work.
 That was why
> >>>I
> >>> >>>>> figured I
> >>> >>>>>> would ask to see if I missed some sort of configuration
step in
> >>> >>> building
> >>> >>>>>> it.
> >>> >>>>>>
> >>> >>>>>> Thanks a ton.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris
A (3980) <
> >>> >>>>>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>> >>>>>>
> >>> >>>>>>> Dear Kevin,
> >>> >>>>>>>
> >>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
> >>> >>>>>>>
> >>> >>>>>>> See this wiki page:
> >>> >>>>>>>
> >>> >>>>>>> https://wiki.apache.org/tika/TikaOCR
> >>> >>>>>>>
> >>> >>>>>>> I¹d be happy to discuss more.
> >>> >>>>>>>
> >>> >>>>>>> Thanks!
> >>> >>>>>>>
> >>> >>>>>>> Cheers,
> >>> >>>>>>> Chris
> >>> >>>>>>>
> >>> >>>>>>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >>>>>>> Chris Mattmann, Ph.D.
> >>> >>>>>>> Chief Architect
> >>> >>>>>>> Instrument Software and Science Data Systems
Section (398)
> >>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA
91109 USA
> >>> >>>>>>> Office: 168-519, Mailstop: 168-527
> >>> >>>>>>> Email: chris.a.mattmann@nasa.gov
> >>> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> >>>>>>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >>>>>>> Adjunct Associate Professor, Computer Science
Department
> >>> >>>>>>> University of Southern California, Los Angeles,
CA 90089 USA
> >>> >>>>>>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> -----Original Message-----
> >>> >>>>>>> From: kevin slote <kslote1@gmail.com>
> >>> >>>>>>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> >>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
> >>> >>>>>>> To: "dev@tika.apache.org" <dev@tika.apache.org>
> >>> >>>>>>> Subject: OCR with tika-server
> >>> >>>>>>>
> >>> >>>>>>>> Hello all,
> >>> >>>>>>>>
> >>> >>>>>>>> I have been testing out the integration
of tika with
> >>>tesseract.
> >>> >>>>>>>> I was wondering if there is  a way to get
tika-server to run
> >>>with
> >>> >>>>>>>> tesseract's OCR capabilities?
> >>> >>>>>>>>
> >>> >>>>>>>> Best
> >>> >>>>>>>>
> >>> >>>>>>>> Kevin Slote
> >>> >>>>>
> >>> >>>
> >>> >
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message