tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
Date Tue, 28 Oct 2014 11:08:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186696#comment-14186696
] 

Tim Allison edited comment on TIKA-1445 at 10/28/14 11:08 AM:
--------------------------------------------------------------

On further thought...I won't have time to sketch this out until tonight or tomorrow...

It might make sense to get rid of the AbstractTerminalMetadataParser class, and have AbstractOCRParser
load the image metadata parsers from a services file; we could then remove the image metadata
parsers from the Parser services list.  For those without Tesseract installed, the TesseractOCRParser
would be a pass-through to the old behavior (no copying of streams, just classic metadata
parsing); for those with it installed, TesseractOCRParser would copy the stream and do a double
pass, once for the metadata and once for the OCR (as in Tyler's patch).

This solution would get us out of the reliance on reverse alphabetic sort order of parser
class names to pick the oat.parser.ocr.TesseractOCRParser as "best" parser for .gif, .jpeg,
etc.  Of course, we're still relying on that order to pick TesseractOCRParser over GDAL for
.png files...


was (Author: tallison@mitre.org):
On further thought...I won't have time to sketch this out until tonight or tomorrow...

It might make sense to get rid of the AbstractTerminalMetadataParser class, and have AbstractOCRParser
load the image metadata parsers from a services file; we could then remove the image metadata
parsers from the Parser services list.  For those without Tesseract installed, the TesseractOCRParser
would be a pass-through to the old behavior (no copying of streams, just classic metadata
parsing); for those with it installed, TesseractOCRParser would copy the stream and do a double
pass, once for the metadata and once for the OCR (as in Tyler's patch).

This solution would get us out of the reliance on alphabetic sort order of parser class names
to pick the oat.ocr.TesseractOCRParser as "best" parser for .gif, .jpeg, etc.  Of course,
we're still relying on that order to pick TesseractOCRParser over GDAL for .png files...

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch,
TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, consider
how to add back in the metadata extraction capabilities by the other Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message