tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files
Date Sun, 21 May 2017 15:40:11 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris A. Mattmann updated TIKA-1454:
    Fix Version/s:     (was: 1.15)

> Extracting as HTML loses links in xlsx, ppt, and pptx files
> -----------------------------------------------------------
>                 Key: TIKA-1454
>                 URL: https://issues.apache.org/jira/browse/TIKA-1454
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
>         Environment: RedHat EL5, EL6, EL7
>            Reporter: Chris Bryant
>            Assignee: Tim Allison
>             Fix For: 1.16
>         Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, urltest.pptx
> I am trying to convert documents to HTML, then looking through the HTML for anchor tags
to find links to external URLs.  This works fine when looking at some document types, including
PDFs, Open Document formats, Microsoft Word formats .doc and .docx, and the older Microsoft
Excel .xls format, but it does not work for any Microsoft Powerpoint formats (.ppt or .pptx)
and it does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and .xlsx formats,
the text is extracted properly and formatted into HTML, but the link is not converted to an
anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly extract links,
and also included samples of .ods and .odp files that do extract links properly.

This message was sent by Atlassian JIRA

View raw message