tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andriy Budzinskyy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1761) Error Parsing PPT (97-2003) files with password protection against modification which were created using Office 2013
Date Mon, 12 Oct 2015 08:24:05 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952801#comment-14952801
] 

Andriy Budzinskyy commented on TIKA-1761:
-----------------------------------------

Well, I would expect that we do not need password for extracting text if file (doc or ppt)
was protected for modification.
The thing is that my attached files were created with the same protected setting but using
different MS Office version.

> Error Parsing PPT (97-2003) files with password protection against modification which
were created using Office 2013
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1761
>                 URL: https://issues.apache.org/jira/browse/TIKA-1761
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7, 1.10
>            Reporter: Andriy Budzinskyy
>            Assignee: Tim Allison
>         Attachments: test-2007.ppt, test-2013.ppt
>
>
> PPT documents created (or saved) as Powerpoint 97-2003 format and protected with password
against modification using Office 2013 fail during extracting text.
> But it works fine Powerpoint 97-2003 format using Office 2007
> {noformat}
> java -jar tika-app-1.10.jar --text test_2003.ppt
> Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@22b0f5af
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:185)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
> Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: PowerPoint
file is encrypted. The correct password needs to be set via Biff8EncryptionKey.setCurrentUserPassword()
>         at org.apache.poi.hslf.EncryptedSlideShow.<init>(EncryptedSlideShow.java:102)
>         at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:259)
>         at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:250)
>         at org.apache.poi.hslf.HSLFSlideShow.<init>(HSLFSlideShow.java:165)
>         at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>         at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         ... 5 more
> {noformat}
> I've debugged Tika library and found that it fails UserEditAtom.encryptSessionPersistIdRef
property. This property is empty in files created with Office 2007 and no-empty with Office
2013.
> I've defragmented PPT files as described in https://social.msdn.microsoft.com/Forums/en-US/e33189a5-0b00-44b7-b084-f2757e9b7536/powerpoint-binary-file-format-decryption?forum=os_binaryfile
> Is this bug of Tika or POI library? 
> Should be it supported per Apache POI [encryption support|https://poi.apache.org/encryption.html]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message