tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam H (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX
Date Mon, 01 Feb 2016 13:37:39 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126248#comment-15126248
] 

Sam H commented on TIKA-1841:
-----------------------------

Hi [~gagravarr],

There has been no reaction to this issue in the past 6 days. Can I assume my proposed structure
is ok?

I have already started implementing this:
https://github.com/zetisam/tika/tree/TIKA-1841

The PPT code allows you to get the slide-notes-footer and slide-notes-header seperately, but
the POI code seems to add these fields to the output anyway, so I don't know if this is of
much use. 

I couldn't find how to do this in PPTX, so maybe this part can be dropped (in order not to
have duplicate content).

The same for slide footers in general. They seem to be added to the content, so having them
as a separate div would be duplicating this content.

Any thoughts?

> Different XML output structure for PPT and PPTX
> -----------------------------------------------
>
>                 Key: TIKA-1841
>                 URL: https://issues.apache.org/jira/browse/TIKA-1841
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is different.

> The structure for PPTX seems as follows:
> {code}
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> ...
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of each slide.
> For powerpoint the structure is as follows:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> <div class="slide-notes">
> {code}
> In my application, I'm using XPath to get the desired information . As the XML structure
is different, I have to differentiate my XPath queries whether the file is PPT (old) or PPTX
(new). It would be nice for Tika to return the same XML for both.
> I would propose changing the XML structure to this:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> {code}
> So, essentially, like the current PPT output, but without the list of notes at the end
(as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break existing
(external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm willing
to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message