tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata
Date Tue, 24 Jun 2014 16:55:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042352#comment-14042352
] 

Hudson commented on TIKA-1353:
------------------------------

SUCCESS: Integrated in tika-trunk-jdk1.7 #64 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/64/])
TIKA-1353 If a File is available, parse ODF documents with it, so that the metadata can always
be processed first (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1605124)
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java


> OpenDocumentParser doesn't correctly process metadata
> -----------------------------------------------------
>
>                 Key: TIKA-1353
>                 URL: https://issues.apache.org/jira/browse/TIKA-1353
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.5
>            Reporter: Steve R
>             Fix For: 1.6
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using OpenDocumentParser, the metadata isn't set correctly. When using it to write
an html file, the only metadata that it knows about is content type because it is set ahead
of time.
> The problem is that when iterating over the zip contents, meta.xml isn't processed before
content.xml. The metadata set on the parse object is correct after parse() returns, however
the contents of the resulting html file is missing all of the metadata.
> Changing the code to be 
> boolean parsedMetaData = false;
> boolean delayLoadContent = false;
> while (entry != null) {
> ...
> } else if (entry.getName().equals("meta.xml")) {
>                 meta.parse(zip, new DefaultHandler(), metadata, context);
>                 parsedMetaData = true;
>                 if (delayLoadContent) {
>                     if (content instanceof OpenDocumentContentParser) {
>                         ((OpenDocumentContentParser) content).parseInternal(zip, handler,
metadata, context);
>                     } else {
>                         // Foreign content parser was set:
>                         content.parse(zip, handler, metadata, context);
>                     }
>                 }
>             } else if (entry.getName().endsWith("content.xml")) {
>                 if (!parsedMetaData) {
>                     delayLoadContent = true;
>                 } else {
>                     if (content instanceof OpenDocumentContentParser) {
>                         ((OpenDocumentContentParser) content).parseInternal(zip, handler,
metadata, context);
>                     } else {
>                         // Foreign content parser was set:
>                         content.parse(zip, handler, metadata, context);
>                     }
>                 }
>             }
> works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message