tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata
Date Wed, 29 Jul 2015 14:38:05 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tim Allison updated TIKA-1607:
    Attachment: TIKA-1607v1_rough_rough.patch

I'm attaching a strawman approach to this...slightly different than initially proposed.  This
would also support TIKA-1295.

Some items:
 1. It feels backwards to have the MetadataValue determine if its Property is appropriate.
 But it seemed like a much smaller change and more extensible...we won't have to change Property
at all. 
 2. Should we try to genericize MetadataValue?
 3. Lots more needs to be done...this is just an initial proposal.

There are some other rough edges...any and all feedback welcome!

> Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata
> -----------------------------------------------------------------------------------------
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.10
>         Attachments: TIKA-1607v1_rough_rough.patch
> I am currently working implementing more comprehensive extraction and enhancement of
the Tika support for Phone number extraction and metadata modeling.
> Right now we utilize the String[] multivalued support available within Tika to persist
phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the String[] paradigm
by implementing a more abstract Collection of Objects such that we could consider and implement
the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, HashMap<String/Property,
String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType:
International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType:
International), (etc: etc)...) (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... additionally it
is a fundamental change to the code Metadata API. I hope that the <String, Object> Mapping
however is flexible enough to allow me to model Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis

This message was sent by Atlassian JIRA

View raw message