tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Palsulich (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-579) DcXMLParser: DC metadata text in extracted body
Date Sun, 01 Mar 2015 07:41:04 GMT

     [ https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tyler Palsulich updated TIKA-579:
---------------------------------
    Affects Version/s:     (was: 0.8)
                       1.8

> DcXMLParser: DC metadata text in extracted body
> -----------------------------------------------
>
>                 Key: TIKA-579
>                 URL: https://issues.apache.org/jira/browse/TIKA-579
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.8
>         Environment: N/A
>            Reporter: Scott Severtson
>
> The DcXMLParser correctly extracts Dublin Core metadata text into the Metadata object,
but the metadata text is included in the extracted "body". 
> Sample XML document:
> ---
> <?xml version="1.0" encoding="UTF-8"?>
> <a xmlns:dc="http://purl.org/dc/elements/1.1/">
> 	<dc:title>This is the title</dc:title>
> 	<dc:creator>Scott Severtson</dc:creator>
> 	<dc:subject>This is the subject</dc:subject>
> 	<b>This is the body text.</b>
> </a>
> ---
> Sample code:
> ---
> URL xmlDocument = ...
> TikaConfig tikaConfig = new TikaConfig();
> ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml");
> ---
> Actual output:
> ---
> 	This is the title
> 	Scott Severtson
> 	This is the subject
> 	This is the body text.
> ---
> Expected output:
> ---
> 	This is the body text.
> ---
> The output is consistent when using ParseUtils *and* when using DcXMLParser directly
with a ContentHandler. The ContentHandler receives a single text node containing concatinated
metadata and body text, so there is no opportunity to externally work around this issue. We
would expect DcXMLParser to remove DC nodes from the body prior to extracting the body text,
to be more consistent with other Tika parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message