tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API
Date Mon, 06 Sep 2010 11:31:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906483#action_12906483

Nick Burch commented on TIKA-482:

I couldn't include ImageMetadataExtractorTest as it uses new features of the extractor that
weren't in the patch...

Looking at your latest git patch:
* I think we do need all the random metadata as-is, since that is all there has been for a
while, and anyone currently using tika will be using those
* Could ExifOldStyleHandler and ExifHandler be merged? I guess ExifOldStyleHandler would want
to be switched from tag iterator to directory.containsTag though?
* For the keywords, would it not be better to use the tika metadata multiple-value support,
rather than underscore stuff?
* What else is needed do you think before we could apply this?

On the date thing, maybe the right thing to do is:
* EXIF original date -> Metadata.DATE, Metadata.CREATION_DATE
* EXIF date -> Metadata.LAST_MODIFIED
Would that make more sense to you?

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format.
For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message