tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andreas Beeker (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1755) Make ppt and pptx paragraph/div breaks more consistent
Date Wed, 30 Sep 2015 20:59:05 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938860#comment-14938860
] 

Andreas Beeker commented on TIKA-1755:
--------------------------------------

I think, the goal would be, to modify common sl in such a way, that there's only one tika
parser class necessary using SlideShowFactory and having the same results for PPT/X.
I already know a few drawbacks of the current implementation:
- line breaks are part of the hslf text runs whereas in xslf these are explicit tokens
- tables are group shapes in hslf, but not in xslf ... but I guess this doesn't matter for
tika

Currently my main goal for POI is to minimize our critical sonar issues ... if this tika issue
is important to you, drop me a line and I try to adapt this for POI 3.14-beta1 ...

> Make ppt and pptx paragraph/div breaks more consistent
> ------------------------------------------------------
>
>                 Key: TIKA-1755
>                 URL: https://issues.apache.org/jira/browse/TIKA-1755
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA-1755.patch
>
>
> In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found that our PPT/PPTX
parsers behave very differently with <p> and <div> breaks, especially now that
we've applied the upgrades from TIKA-1707.
> I propose adding quite a few more <p> to capture the sentence/bullet level breaks
in PPTX as we're now doing for PPT.
> There are a handful of other things that we could clean up (table handling) as well.
> Some of these changes may be relevant to this [discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3CCAL8PwkY96_GKJmps6ZXuoe7H7-byvpxJbkTBuy1goKU3sKZMtQ@mail.gmail.com%3E].
 [~shaie], any input?
> Patch and example output to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message