tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Palsulich (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (TIKA-675) PackageExtractor should track names of recursively nested resources
Date Mon, 02 Mar 2015 17:58:05 GMT

     [ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tyler Palsulich closed TIKA-675.
    Resolution: Fixed

Marking as fixed. Please see the RecursiveParserWrapper. Thanks Nick.

> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Andrzej Bialecki 
> When parsing archive formats the hierarchy of names is not tracked, only the current
embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar
to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika
API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code}
...or otherwise track the parent-child relationship - e.g. some applications need this information
to indicate what composite documents to delete from the index after a container archive has
been deleted.

This message was sent by Atlassian JIRA

View raw message