tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2735) notes and footer contents are duplicated in extracting text from power point slides
Date Thu, 04 Oct 2018 17:33:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638585#comment-16638585

Tim Allison commented on TIKA-2735:

On the master branch, I just added configurability for allowing the user to turn off extraction
from notes sections and from master sections.  There are three types of masters: master slide,
master notes, master handout.  I think one variable should handle all of those.

There are some new unit tests that aren't passing, and I can't figure out if this is user
error, a bug in POI or a happenstance of how the documents were generated.

I also cleaned up, and I think, improved extraction from the notes section in ppt.

IMHO, these changes are too big to make it into 1.19.1, but they should be ok (after large
scale regression tests) to go into 1.20.

> notes and footer contents are duplicated in extracting text from power point slides
> -----------------------------------------------------------------------------------
>                 Key: TIKA-2735
>                 URL: https://issues.apache.org/jira/browse/TIKA-2735
>             Project: Tika
>          Issue Type: Bug
>          Components: handler
>    Affects Versions: 1.18
>            Reporter: feng ye
>            Priority: Major
>         Attachments: Oneslide.ppt, pptTextResults.txt
> notes and footer contents are duplicated at the end when extract text from ppt slides
(like the one in the attachment). Both the input file and the text results are attached. 
> Is there a configuration option that can be used to suppress this kind of duplication?

This message was sent by Atlassian JIRA

View raw message