tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
Date Wed, 04 Jun 2014 14:55:02 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Nick Burch updated TIKA-1212:

    Attachment: RecursiveParsingExample.java

This is why we should have our examples pulled from svn, where we can check the compile and
run... (See dev@ posts about this)

I've fixed the logic on the wiki, and attached is an example program which lets you pick between
the two different wiki based examples, which we may want to put into svn under tika-examples/src/main/java/org/apache/tika/examples/RecursiveParsingExample.java
at a later date. It does seem to work correctly now

> Recursive Extraction of Archive File
> ------------------------------------
>                 Key: TIKA-1212
>                 URL: https://issues.apache.org/jira/browse/TIKA-1212
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Vikram
>            Priority: Critical
>         Attachments: RECURSIVE_PARSER_WRAPPER_HACK.patch, RecursiveMetadataParserZukka.java,
RecursiveParsingExample.java, TIKA-Output.xlsx, abc.zip, abc.zip, test_recursive_embedded.docx
> Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
> Requirement:
> -----------------
> abc.zip
>    ---> a.doc
>    ---> b.xls
>    ---> pqr.zip
>   -------------> m.ppt
> There are two issues with TIKA:
> 1. How to block extraction embedded doc separately optionally?
> 2. When I extract recussively, file name / or resourceKeyName is not coming properly.
For example
>     --> a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is fine
BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value
>     --> Even for the Embedded doc, only random name is coming.. not even with proper
file path.

This message was sent by Atlassian JIRA

View raw message