tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Palsulich <tpalsul...@gmail.com>
Subject Re: [jira] [Commented] (TIKA-93) OCR support
Date Mon, 02 Jun 2014 15:17:24 GMT
Hi,

> Tesseract is by itself a project that written on C/C++ and should be compiled
differently for each platform.
Good point! We should figure out a way to fail gracefully when Tesseract
isn't installed, right? Unless there is, in fact, some pure Java OCR
implementation.

Another thought, we should add OCR as a command line option -- one option
for extracting images, one for running OCR (which always enables image
extraction).

Tyler


On Thu, May 29, 2014 at 1:26 PM, Oleg Tikhonov <oleg@apache.org> wrote:

> Guys,
> Tesseract is by itself a project that written on C/C++ and should be
> compiled differently for each platform.
> Personally, i would put a requirement for those who want to work with
> tesseract. Not sure that putting Tesseract in the sources is a right way to
> go.
>
> >>How good tesseract is -  depends on trained data at least + quality of
> the input images. No simple answer exists.
>
> BR,
> Oleg
>
>
> On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) <
> jira@apache.org
> > wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810
> ]
> >
> > Luis Filipe Nassif commented on TIKA-93:
> > ----------------------------------------
> >
> > Thank you very much [~tpalsulich] for including unit tests! We could also
> > include tests for normal images (not embedded).
> >
> > There is a simple timeout control that throws a TikaException with
> > specific message if it happens. The idea to force setting a
> > TesseractOCRConfig object in parseContext to run OCR is to not affect
> users
> > that do not want OCR, exactly because it could take seconds, even
> minutes.
> > So TesseractOCRParser can be included in Tika Parser list by default with
> > no problem. We also could include a warning about OCR slowness in the
> class
> > description.
> >
> > I have no idea how to include Tesseract in the sources. Maybe Tika
> > commiters can help with this?
> >
> > > OCR support
> > > -----------
> > >
> > >                 Key: TIKA-93
> > >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> > >             Project: Tika
> > >          Issue Type: New Feature
> > >          Components: parser
> > >            Reporter: Jukka Zitting
> > >            Assignee: Chris A. Mattmann
> > >            Priority: Minor
> > >             Fix For: 1.6
> > >
> > >         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> > TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> > TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
> > >
> > >
> > > I don't know of any decent open source pure Java OCR libraries, but
> > there are command line OCR tools like Tesseract (
> > http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika
> to
> > extract text content (where available) from image files.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.2#6252)
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message