tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times
Date Sun, 08 Feb 2009 22:28:59 GMT

     [ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved TIKA-197.

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks for reporting this!

This issue was caused by the OfficeParser class using a special pattern for detecting Outlook-specific
entries inside Microsoft's OLE2 container format. Outlook-specific parsing was triggered whenever
an internal entry matching the pattern was detected. Our previous test .msg file only contained
one such entry so we never saw this issue, but apparently it's possible and even likely for
Outlook files to contain multiple such entries.

I fixed the issue in revision 742187 simply by introducing a special marker flag that prevents
the Outlook extractor from being fired more than once per document being parsed. It's a bit
ugly, but it works. :-)

> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>                 Key: TIKA-197
>                 URL: https://issues.apache.org/jira/browse/TIKA-197
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: kumar raja jana
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>         Attachments: MIME.msg
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message