tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
Date Tue, 28 Oct 2014 01:40:34 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tim Allison updated TIKA-1445:
    Attachment: TIKA-1445_tallison_v2_20141027.patch

This is more invasive than I'd like, and it does not solve all problems, and there are some
important printlns still in there.

I'm sure this was part of the plan in the integration, but it seems a bit on the side of dark
magic that the Tesseract parser is selected for image files by the AutoDetectParser only because
the full class name sorts after oat.image.ImageParser, etc.

Am I understanding this correctly?  Do we want to take away some of the magic?

I added an AbstractTerminalImageMetadataParser so that we could gather together all classes
used to parse just the metadata of images.  This allows the OCRParsers to go through all the
parsers and pick out only those that are not composite but do parse image metadata.  Perhaps
we should remove these parsers that implement this from the AutoDetectParser??? Still a bit

I think our tests should not add the TesseractOCRParser to the ParseContext as a parser. 
It would be far better to pass in AutoDetectParser so that the TesseractOCRParser operates
on all embedded images, no matter the depth.

This patch is not a solution, only some thoughts.

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch,
TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch
> Now that Tesseract is the default image parser in Tika for many image types, consider
how to add back in the metadata extraction capabilities by the other Image parsers.

This message was sent by Atlassian JIRA

View raw message