tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hanna Sahle (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2168) Incorrect <a> parsing in PdfParser
Date Mon, 07 Nov 2016 08:16:58 GMT
Hanna Sahle created TIKA-2168:
---------------------------------

             Summary: Incorrect <a> parsing in PdfParser
                 Key: TIKA-2168
                 URL: https://issues.apache.org/jira/browse/TIKA-2168
             Project: Tika
          Issue Type: Bug
          Components: parser, server
    Affects Versions: 1.13
         Environment: Running Tika server 1.13 and testing via http api 
            Reporter: Hanna Sahle


PdfParser returns self-closing tags for{code:xml}<a/>{code} and {code:xml}<p/>{code},
which is not html supported and does not render correctly in any browsers.

{code:xml}<a href="https://wiki.apache.org/tika/TikaJAXRS"/>{code} in the example below
should be {code:xml}<a ref="https://wiki.apache.org/tika/TikaJAXRS"></a>{code}

We have tested both pdf converted from word and google documents with the same results. This
is an example output that we get when parsing a pdf-document with a link:
 
{code:xml}
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="date" content="2016-11-07T07:51:14Z"/>
        <meta name="pdf:PDFVersion" content="1.5"/>
        <meta name="xmp:CreatorTool" content="Microsoft&reg; Word 2016"/>
        <meta name="access_permission:modify_annotations" content="true"/>
        <meta name="access_permission:can_print_degraded" content="true"/>
        <meta name="dcterms:created" content="2016-11-07T07:51:14Z"/>
        <meta name="Last-Modified" content="2016-11-07T07:51:14Z"/>
        <meta name="dcterms:modified" content="2016-11-07T07:51:14Z"/>
        <meta name="dc:format" content="application/pdf; version=1.5"/>
        <meta name="xmpMM:DocumentID" content="uuid:7C86A62C-A4B2-464A-AAEC-5524E170E2AF"/>
        <meta name="Last-Save-Date" content="2016-11-07T07:51:14Z"/>
        <meta name="access_permission:fill_in_form" content="true"/>
        <meta name="meta:save-date" content="2016-11-07T07:51:14Z"/>
        <meta name="pdf:encrypted" content="false"/>
        <meta name="modified" content="2016-11-07T07:51:14Z"/>
        <meta name="Content-Type" content="application/pdf"/>
        <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
        <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
        <meta name="meta:creation-date" content="2016-11-07T07:51:14Z"/>
        <meta name="created" content="Mon Nov 07 07:51:14 UTC 2016"/>
        <meta name="access_permission:extract_for_accessibility" content="true"/>
        <meta name="access_permission:assemble_document" content="true"/>
        <meta name="xmpTPg:NPages" content="1"/>
        <meta name="Creation-Date" content="2016-11-07T07:51:14Z"/>
        <meta name="access_permission:extract_content" content="true"/>
        <meta name="access_permission:can_print" content="true"/>
        <meta name="producer" content="Microsoft&reg; Word 2016"/>
        <meta name="access_permission:can_modify" content="true"/>
        <title></title>
    </head>
    <body>
        <div class="page">
            <p/>
            <p>This is a word document, converted to pdf.  
</p>
            <p>Example link: https://wiki.apache.org/tika/TikaJAXRS 
</p>
            <p> </p>
            <p/>
            <div class="annotation">
                <a href="https://wiki.apache.org/tika/TikaJAXRS"/>
            </div>
        </div>
    </body>
</html>
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message