tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Gauss II (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements
Date Tue, 11 Jun 2013 03:17:20 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ray Gauss II resolved TIKA-1133.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.4

Resolved in r1491680.
                
> Ability to Allow Empty and Duplicate Tika Values for XML Elements
> -----------------------------------------------------------------
>
>                 Key: TIKA-1133
>                 URL: https://issues.apache.org/jira/browse/TIKA-1133
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ray Gauss II
>            Assignee: Ray Gauss II
>             Fix For: 1.4
>
>
> In some cases it is beneficial to allow empty and duplicate Tika metadata values for
multi-valued XML elements like RDF bags.
> Consider an example where the original source metadata is structured something like:
> {code}
> <Person>
>   <FirstName>John</FirstName>
>   <LastName>Smith</FirstName>
> </Person>
> <Person>
>   <FirstName>Jane</FirstName>
>   <LastName>Doe</FirstName>
> </Person>
> <Person>
>   <FirstName>Bob</FirstName>
> </Person>
> <Person>
>   <FirstName>Kate</FirstName>
>   <LastName>Smith</FirstName>
> </Person>
> {code}
> and since Tika stores only flat metadata we transform that before invoking a parser to
something like:
> {code}
>  <custom:FirstName>
>   <rdf:Bag>
>    <rdf:li>John</rdf:li>
>    <rdf:li>Jane</rdf:li>
>    <rdf:li>Bob</rdf:li>
>    <rdf:li>Kate</rdf:li>
>   </rdf:Bag>
>  </custom:FirstName>
>  <custom:LastName>
>   <rdf:Bag>
>    <rdf:li>Smith</rdf:li>
>    <rdf:li>Doe</rdf:li>
>    <rdf:li></rdf:li>
>    <rdf:li>Smith</rdf:li>
>   </rdf:Bag>
>  </custom:LastName>
> {code}
> The current behavior ignores empties and duplicates and we don't know if Bob or Kate
ever had last names.  Empties or duplicates in other positions result in an incorrect mapping
of data.
> We should allow the option to create an {{ElementMetadataHandler}} which allows empty
and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message