tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daan de Wit (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (TIKA-203) Earlier metadata extraction in ParsingReader
Date Fri, 17 Jul 2009 13:07:14 GMT

    [ https://issues.apache.org/jira/browse/TIKA-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732512#action_12732512
] 

Daan de Wit edited comment on TIKA-203 at 7/17/09 6:05 AM:
-----------------------------------------------------------

does not work for me on Ubuntu 8.04 with Sun java 1.5.0_16 on 1 processor when parsing certain
word documents

      was (Author: d.de.wit):
    does not work for me on Ubuntu 8.04 with Sun java 1.5.0_16 on 1 processor
  
> Earlier metadata extraction in ParsingReader
> --------------------------------------------
>
>                 Key: TIKA-203
>                 URL: https://issues.apache.org/jira/browse/TIKA-203
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.3
>
>
> The normal parse() method guarantees that all extracted metadata will be available in
the metadata object once the method returns. But since the ParsingReader class runs the parse()
method in a background thread, one can only assume that extracted metadata is available once
the entire character stream has been consumed. This is troublesome for example when creating
Lucene Document objects, as Lucene postpones reading the given character stream to when the
already constructed Document is passed to an IndexWriter. The result is that (depending on
thread scheduling and the structure of the input document format) metadata may not be available
for inclusion in the indexed Document.
> One way of fixing this issue is to add a small character buffer in ParsingReader, and
to make sure that the buffer is filled with extracted text before the ParsingReader constructor
returns. This should ensure that relevant document metadata is almost always available, since
the majority of document formats have all or most metadata at the beginning of the document
stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message