tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-1212) Recursive Extraction of Archive File
Date Mon, 02 Jun 2014 17:06:01 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nick Burch resolved TIKA-1212.
------------------------------

    Resolution: Invalid

The problem is that you're not tracking how far down the rabbit hole you've gone when you
recurse. When your recursing parser is processing a resource, if it wants to recurse for another
time, it needs to track here it is and tell the next one down where it came from. 

I've added a simple example of this to the wiki - https://wiki.apache.org/tika/RecursiveMetadata#Tracking_how_far_down_the_Rabbit_Hole_you_have_gone

Various other approaches will work too, the trick is that when you recurse once more you need
to track where you came from if you want relative paths

> Recursive Extraction of Archive File
> ------------------------------------
>
>                 Key: TIKA-1212
>                 URL: https://issues.apache.org/jira/browse/TIKA-1212
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Vikram
>            Priority: Critical
>         Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, abc.zip, abc.zip
>
>
> Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
> Requirement:
> -----------------
> abc.zip
>    ---> a.doc
>    ---> b.xls
>    ---> pqr.zip
>   -------------> m.ppt
> There are two issues with TIKA:
> 1. How to block extraction embedded doc separately optionally?
> 2. When I extract recussively, file name / or resourceKeyName is not coming properly.
For example
>     --> a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is fine
BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value
abc.zip/pqr.zip/m.ppt.
>     --> Even for the Embedded doc, only random name is coming.. not even with proper
file path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message