tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Caruana Galizia (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2174) Too few formats in support declared by TesseractOCRParser
Date Wed, 09 Nov 2016 13:13:58 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matthew Caruana Galizia updated TIKA-2174:
------------------------------------------
    Description: 
A complete install of Leptonica with Tesseract will add support for formats that are not declared
by TesseractOCRParser. These include JP2, JPX and PPM.

Tesseract produces OCR output fine for JPX images as of this version:

{noformat}
  $ tesseract -v
     tesseract 3.04.01
       leptonica-1.73
         libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
{noformat}

However, these types are not declared by getSupportTypes so no output is produced for PDFs
which contained JPX images of scanned documents, for example.

  was:
Tesseract produces OCR output fine for JPX images as of this version:

{noformat}
  $ tesseract -v
     tesseract 3.04.01
       leptonica-1.73
         libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
{noformat}

However, these types are not declared by getSupportTypes so no output is produced for PDFs
which contained JPX images of scanned documents, for example.

        Summary: Too few formats in support declared by TesseractOCRParser  (was: JP2 and
JPX (JPEG 2000) support not declared by TesseractOCRParser)

> Too few formats in support declared by TesseractOCRParser
> ---------------------------------------------------------
>
>                 Key: TIKA-2174
>                 URL: https://issues.apache.org/jira/browse/TIKA-2174
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Matthew Caruana Galizia
>
> A complete install of Leptonica with Tesseract will add support for formats that are
not declared by TesseractOCRParser. These include JP2, JPX and PPM.
> Tesseract produces OCR output fine for JPX images as of this version:
> {noformat}
>   $ tesseract -v
>      tesseract 3.04.01
>        leptonica-1.73
>          libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
> {noformat}
> However, these types are not declared by getSupportTypes so no output is produced for
PDFs which contained JPX images of scanned documents, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message