tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources
Date Tue, 24 Jan 2012 15:04:38 GMT

    [ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192200#comment-13192200
] 

Nick Burch commented on TIKA-675:
---------------------------------

We could probably do this with a wrapper parser, which tracks the name, outputs the nested
name to the metadata, then delegates a different parser for the actual processing

If we added this, we'd need to decide on what metadata key to put this in (a new one, or change
the resource name?), and how to separate parts (maybe an ! like in VFS?)

It should be very quick to do though, once those are decided
                
> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Andrzej Bialecki 
>
> When parsing archive formats the hierarchy of names is not tracked, only the current
embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar
to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika
API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code}
...or otherwise track the parent-child relationship - e.g. some applications need this information
to indicate what composite documents to delete from the index after a container archive has
been deleted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message