tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian McColgan (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (TIKA-2588) Tika detecting/parsing pptx with embedded Excel worksheet(s)...
Date Sat, 03 Mar 2018 15:44:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brian McColgan closed TIKA-2588.
--------------------------------

Issue resolved very quickly effectively by the maestro Tika-developer T.A.  Thank you once
again, you rock!

> Tika detecting/parsing pptx with embedded Excel worksheet(s)...
> ---------------------------------------------------------------
>
>                 Key: TIKA-2588
>                 URL: https://issues.apache.org/jira/browse/TIKA-2588
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.17
>         Environment:  
>            Reporter: Brian McColgan
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.18, 2.0.0
>
>         Attachments: foo.out, pptEmbedExcelDoubleClickFromWorkbook.PNG, pptEmbedExcelInEmptyWorkbook.PNG,
tikaSample.pptx
>
>
> Hello tika-developers,
> First, a big 'thank-you' for creating and maintaining Apache-Tika!  A really useful
capability/service that can be used in so many different ways.  You folks are the true Debabelizer
(h2g2.com).
> On to issue-encountered: using Tika 1.17 to extract an embedded Excel object out of a
pptx is causing issues.  Simple example attached to this Jira-issue ([^tikaSample.pptx])
which if run against Tika 1.17 (with verbose/list-parsers/list-detectors) provides the output
in ([^foo.out]).  The deck contains a title slide, and a single-slide with embedded Excel
object on it.
> As noted to [~gagravarr] on S-Overflow, I grabbed the unit-test data which you use in
your parser/office JUnit suite (test_ppt_embedded_two_slides.pptx) and tried opening in Office/PPT
2016.  I selected (with mouse) the embedded sheet (had Alfresco logo in it) and pasted it
into an empty Office/Excel 2016 workbook.  When I tried to interact with it, I had to double-click
to make it active.  As a result, I ended up with two Excel instances on my Windows 10 desktop
(the original object in 1, the Excel worksheet in another).  I have included a picture of
the embedded Excel object pasted into the workbook...  !pptEmbedExcelInEmptyWorkbook.PNG!
).
> followed by the worksheet opened inside the workbook (required double-click within the
black-bordered area in the first pic above):
> !pptEmbedExcelDoubleClickFromWorkbook.PNG!
> I managed to extract the embedded object using apache POI.  The logic sequence was
something like the following:
>  # Create an XMLSlideShow object, and pull the list of underlying slide entities.
>  # Walk the list of XSLFSlide(s), searching for a matching slide (by name) - e.g. 'MFL'.
>  # Examine PackagePart of XSLFSlide (matching name) and for content-type.
>  # If pPart.content-type is 'application/vnd.openxmlformats-officedocument.oleObject'
then - 'candidate FOUND'.
>  # Build POIFS around the candidate FOUND, extract root of FileSystem.
>  # Verify that root has entries for \{ 'Package', '\u0001Ole', and '\u0001CompObj' }.
>  # Extract entry '\u0001CompObj', verify entry is a DocumentEntry and underlying bytes
for DocumentNode match an 'Excel' signature.
>  # If (step 7 is true) -> extract entry 'Package'.
>  # The resulting entry represents the byte-stream of the embedded Excel entity.
> I was able to instantiate this into a new workbook (as an example) using POI, and when
I opened it, the worksheet was correctly embedded in that 'example.xlsx'.
> I am not as familiar with Tika, so was a little less comfortable trying to walk it through. 
I thought however, recreating this path would provide further insight for you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message