tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith R. Bennett (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-35) Extract MsOffice properties
Date Thu, 27 Sep 2007 19:52:53 GMT

    [ https://issues.apache.org/jira/browse/TIKA-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530822

Keith R. Bennett commented on TIKA-35:

Rida -

The big question is: do we support the ability of parser implementations to make multiple
passes over a stream?  If so, then we need to incorporate this cleanly into the architectural
design.  Possible solutions are:

1) Save the contents of the stream during the first pass.  Or, if the stream supports, use
mark() and release().
2) Pass to the Parsers a URL instead of an InputStream so that we can create a stream multiple
times.  This is simpler, but runs the risk of the resource changing between stream instantiations,

IMO it would not be a good idea to put a resource identifier in the Parser class, even temporarily
-- this is the reverse direction from our goal of making the parsers stateless.

Instead, we could start discussing (or should I say continue to discuss?) how to support multiple
passes cleanly in the architecture.


P.S. For anyone having trouble applying Rida's patch, passing the "-p5" option to patch worked
for me.

> Extract MsOffice properties
> ---------------------------
>                 Key: TIKA-35
>                 URL: https://issues.apache.org/jira/browse/TIKA-35
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.1-incubator
>            Reporter: Rida Benjelloun
>             Fix For: 0.1-incubator
>         Attachments: tika35.patch
> Hi,
> I have developed a patch that allows MsOffice properties extraction. I wasn't able to
extract the MsOffice properties and full text from a single inputstream, I always get this
error : java.io.IOException Source code of java.io.IOException: Unable to read entire header;
-1 bytes read;
> expected 512 bytes. 
> I don't know how they make it work in Nutch (any ideas ?).
> To get it work, I have added "filePath" variable in the parser class, and I populate
it from ParseUtils class. After that I create an inputStream from filePath or Url and I use
it to extract properties and I use the default inputstream to extract full text.
> I didn't commit this modification; I would like to have your opinions before.
> Regards.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message