tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Giuseppe Totaro (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability
Date Mon, 27 Jul 2015 17:21:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643066#comment-14643066

Giuseppe Totaro commented on TIKA-1691:

Hi [~gagravarr], Hi [~chrismattmann],

did you have any chance to read my last comment?


> Apache Tika for enabling metadata interoperability
> --------------------------------------------------
>                 Key: TIKA-1691
>                 URL: https://issues.apache.org/jira/browse/TIKA-1691
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: mapping, metadata
>         Attachments: mapping_example.pdf
> If am not wrong, enabling consistent metadata across file formats is already (partially)
provided into Tika by relying on {{TikaCoreProperties}} and, within the context of Solr, {{ExtractingRequestHandler}}
(by defining how to map metadata fields in {{solrconfig.xml}}). However, I am working on a
new component for both schema mapping (to operate on the name of metadata properties) and
instance transformation (to operate on the value of metadata) that consists, essentially,
of the following changes:
> * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates the {{set}}
method (currently, line number 367 of {{Metadata.java}}) by applying the given mapping functions
(via configuration) before setting metadata properties.
> * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility methods to map
a set of metadata to the target schema.
> * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be configured
via XML file (organized as showed in the following snippet) and allows to perform a fine-grained
metadata mapping by using Java reflection.
> {code:xml|title=tika-metadata.xml|borderStyle=solid}
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
>   <mappings>
>     <mapping type="type/sub-type">
>       <relation name="SOURCE_FIELD">
>         <target>TARGET_FIELD</target>
>         <expression>exclude|include|equivalent|overlap</expression>
>         <function name="FUNCTION_NAME">
>           <argument>ARGUMENT_VALUE</argument>
>         </function>
>         <cardinality>
>           <source>SOURCE_CARDINALITY</source>
>           <target>TARGET_CARDINALITY</target>
>           <order>ORDER_NUMBER</order>
>           <dependencies>
>             <field>FIELD_NAME</field>
>           </dependencies>
>         </cardinality>
>       </relation>
>     </mapping>
>     ...
>     <mapping> <!-- This contains the fallback strategy for unknown metadata
>       <relation>
>         ...
>       </relation>
>     <mapping>
>   </mappings>
> </properties>
> {code}
> The theoretical definition of metadata mapping is available in "[A survey of techniques
for achieving metadata interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]".
This paper shows also some basic examples of metadata mappings.
> Currently, I am still working on some core functionalities, but I have already performed
some experiments by using a small prototype.
> By the way, I think that we should modify the method {{add}} in order to use {{set}}
instead of {{metadata.put}} (currently, line number 316 of {{Metadata.java}}). This is a trivial
change (I could create a new Jira issue about that), but it would allow to be coherent with
the other implementation of {{add}} method and, moreover, the methods of {{Metadata}} could
be extended more easily.
> I would really appreciate your feedback about this proposal. If you believe that it is
a good idea, I could provide the code in few days.
> Thanks a lot,
> Giuseppe

This message was sent by Atlassian JIRA

View raw message