tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörg Ehrlich (JIRA) <j...@apache.org>
Subject [jira] [Commented] (TIKA-930) Consolidation of Some Tika Core Properties
Date Tue, 29 May 2012 16:53:24 GMT

    [ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284925#comment-13284925
] 

Jörg Ehrlich commented on TIKA-930:
-----------------------------------

Some answers to Ray's comments:

Creator:
The DublinCore creator is usually considered the creator of the intellectual property, not
the creator of the file. That is what the "creator tool" property is for. So we should stick
with the "creator" property and don't use "author" or any other additional key.

Rating:
I think we should better not use anything more generic here. The generic approaches taken
in the past are the reason why we have this huge mess of incompatible applications today.
There is a strong reason why the Metadata Working Group has introduced this definition as
it is. A lot of important applications understand and use this definition today. And didn't
we say we wanted to use only something which is clearly defined?

Geographic:
Have you found any files or file types which are actually using the W3C approach to store
geolocation data? All I have seen until today are using Exif to store it :)


                
> Consolidation of Some Tika Core Properties
> ------------------------------------------
>
>                 Key: TIKA-930
>                 URL: https://issues.apache.org/jira/browse/TIKA-930
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>
> There are a few properties in TikaCoreProperties which overlap and I think we should
minimize ambiguity by consolidating them into a single composite property with the clearest
name, the most general specification referenced as its primary property, and the others and
deprecated strings as its secondaries.
> Here's the proposed pseudo-code for the changes:
> Remove TikaCoreProperties.SUBJECT
> TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, MSOffice.KEYWORDS,
Metadata.SUBJECT }
> Remove TikaCoreProperties.DATE
> TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, MSOffice.CREATION_DATE,
Metadata.DATE }
> Remove TikaCoreProperties.MODIFIED
> TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, MSOffice.LAST_SAVED,
Metadata.MODIFIED, "Last-Modified" }
> and an example of the Java changes:
> {code:title=TikaCoreProperties.java *Before*}
>     /**
>      * @see DublinCore#SUBJECT
>      */
>     public static final Property SUBJECT = Property.composite(DublinCore.SUBJECT, 
>             new Property[] { Property.internalText(Metadata.SUBJECT) });
>       
>     /**
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = Property.composite(Office.KEYWORDS,
>             new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
> {code}
> would become
> {code:title= TikaCoreProperties.java *After*}
>     /**
>      * @see DublinCore#SUBJECT
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = Property.composite(DublinCore.SUBJECT,
>             new Property[] { 
>     		    Office.KEYWORDS, 
>     		    Property.internalTextBag(MSOffice.KEYWORDS),
>     		    Property.internalText(Metadata.SUBJECT)
>     		});
> {code}
> Since this would require a bit of refactoring for parsers that use the properties being
removed I thought it best to get some feedback before working up a full patch.
> Does this seem like a reasonable approach?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message