tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörg Ehrlich (JIRA) <j...@apache.org>
Subject [jira] [Commented] (TIKA-930) Consolidation of Some Tika Core Properties
Date Mon, 02 Jul 2012 09:12:44 GMT

    [ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404953#comment-13404953

Jörg Ehrlich commented on TIKA-930:

Hi Ray and Nick,

It is very important to also "educate" average developers to use the standards in the proper
way. As I wrote for the Rating field: It is imperative to stick with standards otherwise you
risk sacrificing interoperability, which is one of the most important features for metadata.
And regarding the Creator field: With IPTC and PLUS there exist very strong and well known
standards to depict who created what part of an asset. And I strongly recommend to stick with
at least one of them instead of coming up with an own proprietary creator scheme which no
one knows about.
It's nice to be able to be pragmatic, but not using standards for metadata today causes a
lot of headache in the future.

Regarding Geo data: I'm ok with using the W3C properties for the core properties.
> Consolidation of Some Tika Core Properties
> ------------------------------------------
>                 Key: TIKA-930
>                 URL: https://issues.apache.org/jira/browse/TIKA-930
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
> There are a few properties in TikaCoreProperties which overlap and I think we should
minimize ambiguity by consolidating them into a single composite property with the clearest
name, the most general specification referenced as its primary property, and the others and
deprecated strings as its secondaries.
> Here's the proposed pseudo-code for the changes:
> Remove TikaCoreProperties.SUBJECT
> TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, MSOffice.KEYWORDS,
Metadata.SUBJECT }
> Remove TikaCoreProperties.DATE
> TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, MSOffice.CREATION_DATE,
Metadata.DATE }
> Remove TikaCoreProperties.MODIFIED
> TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, MSOffice.LAST_SAVED,
Metadata.MODIFIED, "Last-Modified" }
> and an example of the Java changes:
> {code:title=TikaCoreProperties.java *Before*}
>     /**
>      * @see DublinCore#SUBJECT
>      */
>     public static final Property SUBJECT = Property.composite(DublinCore.SUBJECT, 
>             new Property[] { Property.internalText(Metadata.SUBJECT) });
>     /**
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = Property.composite(Office.KEYWORDS,
>             new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
> {code}
> would become
> {code:title= TikaCoreProperties.java *After*}
>     /**
>      * @see DublinCore#SUBJECT
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = Property.composite(DublinCore.SUBJECT,
>             new Property[] { 
>     		    Office.KEYWORDS, 
>     		    Property.internalTextBag(MSOffice.KEYWORDS),
>     		    Property.internalText(Metadata.SUBJECT)
>     		});
> {code}
> Since this would require a bit of refactoring for parsers that use the properties being
removed I thought it best to get some feedback before working up a full patch.
> Does this seem like a reasonable approach?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message